Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains >800K CSV files behind the GitTables 1M corpus.
For more information about the GitTables corpus, visit:
- our website for GitTables, or
Facebook
Twitterhttp://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
On the official website the dataset is available over SQL server (localhost) and CSVs to be used via Power BI Desktop running on Virtual Lab (Virtaul Machine). As per first two steps of Importing data are executed in the virtual lab and then resultant Power BI tables are copied in CSVs. Added records till year 2022 as required.
this dataset will be helpful in case you want to work offline with Adventure Works data in Power BI desktop in order to carry lab instructions as per training material on official website. The dataset is useful in case you want to work on Power BI desktop Sales Analysis example from Microsoft website PL 300 learning.
Download the CSV file(s) and import in Power BI desktop as tables. The CSVs are named as tables created after first two steps of importing data as mentioned in the PL-300 Microsoft Power BI Data Analyst exam lab.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Canada Trademarks Dataset
18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303
Dataset Selection and Arrangement (c) 2021 Jeremy Sheff
Python and Stata Scripts (c) 2021 Jeremy Sheff
Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.
This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.
Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.
Terms of Use:
As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.
The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:
The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.
Details of Repository Contents:
This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:
If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.
The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.
With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.
The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.
This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.
Facebook
TwitterThe UK House Price Index is a National Statistic.
Download the full UK House Price Index data below, or use our tool to https://landregistry.data.gov.uk/app/ukhpi?utm_medium=GOV.UK&utm_source=datadownload&utm_campaign=tool&utm_term=9.30_17_01_24" class="govuk-link">create your own bespoke reports.
Datasets are available as CSV files. Find out about republishing and making use of the data.
Google Chrome is blocking downloads of our UK HPI data files (Chrome 88 onwards). Please use another internet browser while we resolve this issue. We apologise for any inconvenience caused.
This file includes a derived back series for the new UK HPI. Under the UK HPI, data is available from 1995 for England and Wales, 2004 for Scotland and 2005 for Northern Ireland. A longer back series has been derived by using the historic path of the Office for National Statistics HPI to construct a series back to 1968.
Download the full UK HPI background file:
If you are interested in a specific attribute, we have separated them into these CSV files:
http://publicdata.landregistry.gov.uk/market-trend-data/house-price-index-data/Average-prices-2023-11.csv?utm_medium=GOV.UK&utm_source=datadownload&utm_campaign=average_price&utm_term=9.30_17_01_24" class="govuk-link">Average price (CSV, 9.4MB)
http://publicdata.landregistry.gov.uk/market-trend-data/house-price-index-data/Average-prices-Property-Type-2023-11.csv?utm_medium=GOV.UK&utm_source=datadownload&utm_campaign=average_price_property_price&utm_term=9.30_17_01_24" class="govuk-link">Average price by property type (CSV, 28.2MB)
http://publicdata.landregistry.gov.uk/market-trend-data/house-price-index-data/Sales-2023-11.csv?utm_medium=GOV.UK&utm_source=datadownload&utm_campaign=sales&utm_term=9.30_17_01_24" class="govuk-link">Sales (CSV, 4.9MB)
http://publicdata.landregistry.gov.uk/market-trend-data/house-price-index-data/Cash-mortgage-sales-2023-11.csv?utm_medium=GOV.UK&utm_source=datadownload&utm_campaign=cash_mortgage-sales&utm_term=9.30_17_01_24" class="govuk-link">Cash mortgage sales (CSV, 6.9MB)
http://publicdata.landregistry.gov.uk/market-trend-data/house-price-index-data/First-Time-Buyer-Former-Owner-Occupied-2023-11.csv?utm_medium=GOV.UK&utm_source=datadownload&utm_campaign=FTNFOO&utm_term=9.30_17_01_24" class="govuk-link">First time buyer and former owner occupier (CSV, 6.6MB)
http://publicdata.landregistry.gov.uk/market-trend-data/house-price-index-data/New-and-Old-2023-11.csv?utm_medium=GOV.UK&utm_source=datadownload&utm_campaign=new_build&utm_term=9.30_17_01_24" class="govuk-link">New build and existing resold property (CSV, 17.2MB)
http://publicdata.landregistry.gov.uk/market-trend-data/house-price-index-data/Indices-2023-11.csv?utm_medium=GOV.UK&utm_source=datadownload&utm_campaign=index&utm_term=9.30_17_01_24" class="govuk-link">Index (CSV, 6.1MB)
http://publicdata.landregistry.gov.uk/market-trend-data/house-price-index-data/Indices-seasonally-adjusted-2023-11.csv?utm_medium=GOV.UK&utm_source=datadownload&utm_campaign=index_season_adjusted&utm_term=9.30_17_01_24" class="govuk-link">Index seasonally adjusted (CSV, 210KB)
http://publicdata.landregistry.gov.uk/market-trend-data/house-price-index-data/Average-price-seasonally-adjusted-2023-11.csv?utm_medium=GOV.UK&utm_source=datadownload&utm_campaign=average-price_season_adjusted&utm_term=9.30_17_01_24" class="govuk-link">Average price seasonally a
Facebook
TwitterThis started as a project that required course data from coursera and due to the lack of such a dataset on common forums, I decided to webscrape the coursera.org and coursera.org/directory for a list of all the courses of coursera.
The first file contains a list of names and urls. The second csv file contains details of each course.
The script to generate/webscrape this dataset directly from coursera can be found at https://github.com/mihirs16/Coursera-Web-Scraper.
All data scraped is solely owned by Coursera.org and is meant to be used only for educational and experimental purposes.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By data.gov.ie [source]
This dataset contains data from the East Atlantic SWAN Wave Model, which is a powerful model developed to predict wave parameters in Irish waters. The output features of the model include Significant Wave Height (m), Mean Wave Direction (degreesTrue) and Mean Wave Period (seconds). These predictions are generated with NCEP GFS wind forcing and FNMOC Wave Watch 3 data as boundaries for the wave generation.
The accuracy of this model is important for safety critical applications as well as research efforts into understanding changes in tides, currents, and sea levels, so users are provided with up-to-date predictions for the previous 30 days and 6 days into the future with download service options that allow selection by date/time, one parameter only and output file type.
Data providers released this dataset under a Creative Commons Attribution 4.0 license at 2017-09-14. It can be used free of charge within certain restrictions set out by its respective author or publisher
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Introduction:
Step 1: Acquire the Dataset:
The first step is getting access to the dataset which is free of cost. The original source of this data is from http://wwave2.marinecstl.org/archive/index?cat=model_height&xsl=download-csv-1 Meanwhile, you can also get your hands on this data by downloading it as a csv file from Kaggle’s website (https://www.kaggle.com/marinecstl/east-atlantic-swan-wave-model). This download should contain seven columns of various parameters; time, latitude, longitude, and significant wave height being the most important ones that you need to be familiar with before using this data set effectively in any project etc..Step 2: Understand Data Columns & Parameters :
Now that you have downloaded the data its time to understand what each column represents and how they are related to each other when comparing datasets from two different locations within one country or across two countries etc.. Time represents daily timestamps for each observation taken at an exact location specified by latitude & longitude parameters respectively while ranging between 0° - +90° (~ 85 degrees) where higher values indicate states closer towards North Pole; inversely lower values indicates states closer towards South Pole respectively.. Significant wave height on other hand represent total displacements in ocean surface due measurable variations within short period caused either due tides or waves i .e caused due weather difference such as wind forcing or during more extreme conditions like oceanic storms etc.,Step 3: Understanding Data Limitation & Applying Exclusion Criteria :
Moreover, keep in mind that since model runs every day across various geographical regions thus inevitable inaccuracy emerges regarding value predictions across any given timeslot; so its essential that users apply advanced criteria during analysis phase taking into consideration natural resource limitation such as current weather conditions and water depth scenarios while compiling buoyancy related readings during particular timestamps respectively when going through information outputted via obtained CSV file OR API services respectively;; however don’t forget these ;predictions may not be used for safety
- Visualizing wave heights in the East Atlantic area over time to map oceanic currents.
- Finding areas of high-wave activity: using this data, researchers can identify unique areas that experience particularly severe waves, which could be essential to know for protecting maritime vessels and informing navigation strategies.
- Predicting future wave behavior: by analyzing current and past trends in SWAN Wave Model data, scientists can predict how significant wave heights will change over future timescales in the studied area
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: download-csv-1.csv | Column name | Descrip...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is our complete database in csv format (with gene names, ID's, annotations, lengths, cluster sizes, and taxonomic classifications) that can be queried on our website. The difference is that it does not have the sequences – those can be downloaded in other files on figshare. This file, as well as those, can be parsed and linked by the gene identifier.We recommend downloading this database and parsing it yourself if you attempt to run a query that is too large for our servers to handle.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Update Frequency: "An automated scan for dataset updates occurs every day at 3:45 a.m."
For up to date information on service requests please visit, https://city.milwaukee.gov/ucc .
A log of the Unified Call Center's service requests.
From potholes, abandoned vehicles, high weeds on vacant lots, and curbside trash to faulty traffic signals, the City of Milwaukee's Unified Call Center (UCC) makes it easy to submit service requests to solve problems. The UCC also allows you to track your service requests. Each time you complete a service request online, you will be assigned a tracking number that you can use to see when a City of Milwaukee representative expects to investigate or take care of your request.
To download XML and JSON files, click the CSV option below and click the down arrow next to the Download button in the upper right on its page.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
About NTPSThe National Teacher and Principal Survey (NTPS) is a system of related questionnaires that provide descriptive data on the context of elementary and secondary education while also giving policymakers a variety of statistics on the condition of education in the United States.The NTPS is a redesign of the Schools and Staffing Survey (SASS), which the National Center for Education Statistics (NCES) conducted from 1987 to 2011. The design of the NTPS is a product of three key goals coming out of the SASS program: flexibility, timeliness, and integration with other Department of Education collections. The NTPS collects data on core topics including teacher and principal preparation, classes taught, school characteristics, and demographics of the teacher and principal labor force every two to three years. In addition, each administration of NTPS contains rotating modules on important education topics such as: professional development, working conditions, and evaluation. This approach allows policy makers and researchers to assess trends on both stable and dynamic topics.Data OrganizationEach table has an associated excel and excel SE file, which are grouped together in a folder in the dataset (one folder per table). The folders are named based on the excel file names, as they were when downloaded from the National Center for Education Statistics (NCES) website.In the NTPS folder, there is a catalog csv that provides a crosswalk between the folder names and the table titles.The documentation folder contains (1) codebooks for NTPS generated in NCES datalabs, (2) questionnaires for NTPS downloaded from the study website and (3) reports related to NTPS found in the NCES resource library
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.
Activities:
Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.
The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.
The amount of data is stated as follows:
Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes
Facebook
TwitterThe entire dataset for Professional Services Contracts by fiscal quarter are available for download in a Microsoft Excel® compatible CSV format below. To download: click on the dataset url, which will open in a new browser tab as a text file. Right click on the text and choose 'save as' csv to download it. The City of Philadelphia, through a joint effort among the Finance Department, Chief Integrity Officer and the Office of Innovation and Technology (OIT), launched an online, open data website for City contracts. The site provides data on the City’s professional services contracts in a searchable format and includes a breakdown of contract dollars by vendor, department and service type. It also features a “frequently asked questions” section to help users understand the available data: http://cityofphiladelphia.github.io/contracts/.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.
----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Facebook
Twitterhttps://github.com/gregorywduke/iNaturalistScraper/releases/tag/v2.0.0
Desktop application that downloads list of image urls contained in an "Export Observations" CSV file from iNaturalist. Written in Python, this scraper uses PySimpleGUI and comes packaged in an executable.
To Use, Follow these Steps:
Visit the "Export Observations" page on iNaturalist.org, and select your desired parameters. Before exporting your observations, make sure you unselect ALL options in (Section 3, Choose Columns) EXCEPT "image_url". That should be the only option enabled. Export your observations and download the .csv provided. Check the .csv you downloaded. Make sure there is no text besides the urls present. If there is a first line that is NOT a URL, delete it. Download iNaturalistScraper v1.0.0 from the "Releases" section, and run the .exe. In the application, select the .csv file you downloaded. Select the folder you want your images to be deposited. Click "Proceed" and let the program download your images. Upwards of 500 images may take 5-10 minutes. More images will take longer, though the process is CPU-dependant. Notes: (PLEASE READ)
Ensure you delete the first line, if it is not a URL. Ensure you ONLY select the "image_url" option in Export Observations, and no other option.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Version update: The originally uploaded versions of the CSV files in this dataset included an extra column, "Unnamed: 0," which is not RAMP data and was an artifact of the process used to export the data to CSV format. This column has been removed from the revised dataset. The data are otherwise the same as in the first version.
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2020. For a description of the data collection, processing, and output methods, please see the "methods" section below.
Methods Data Collection
RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).
Data are downloaded in two sets per participating IR. The first set includes page level statistics about URLs pointing to IR pages and content files. The following fields are downloaded for each URL, with one row per URL:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
Following data processing describe below, on ingest into RAMP a additional field, citableContent, is added to the page level data.
The second set includes similar information, but instead of being aggregated at the page level, the data are grouped based on the country from which the user submitted the corresponding search, and the type of device used. The following fields are downloaded for combination of country and device, with one row per country/device combination:
country: The country from which the corresponding search originated.
device: The device used for the search.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.
More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en
Data Processing
Upon download from GSC, the page level data described above are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of page level statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the page level data which records whether each page/URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."
The data aggregated by the search country of origin and device type do not include URLs. No additional processing is done on these data. Harvested data are passed directly into Elasticsearch.
Processed data are then saved in a series of Elasticsearch indices. Currently, RAMP stores data in two indices per participating IR. One index includes the page level data, the second index includes the country of origin and device type data.
About Citable Content Downloads
Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.
CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).
For any specified date range, the steps to calculate CCD are:
Filter data to only include rows where "citableContent" is set to "Yes."
Sum the value of the "clicks" field on these rows.
Output to CSV
Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above. Also as noted above, daily data are downloaded for each IR in two sets which cannot be combined. One dataset includes the URLs of items that appear in SERP. The second dataset is aggregated by combination of the country from which a search was conducted and the device used.
As a result, two CSV datasets are provided for each month of published data:
page-clicks:
The data in these CSV files correspond to the page-level data, and include the following fields:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No.
index: The Elasticsearch index corresponding to page click data for a single IR.
repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.
Filenames for files containing these data end with “page-clicks”. For example, the file named 2020-01_RAMP_all_page-clicks.csv contains page level click data for all RAMP participating IR for the month of January, 2020.
country-device-info:
The data in these CSV files correspond to the data aggregated by country from which a search was conducted and the device used. These include the following fields:
country: The country from which the corresponding search originated.
device: The device used for the search.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
index: The Elasticsearch index corresponding to country and device access information data for a single IR.
repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.
Filenames for files containing these data end with “country-device-info”. For example, the file named 2020-01_RAMP_all_country-device-info.csv contains country and device data for all participating IR for the month of January, 2020.
References
Google, Inc. (2021). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
I am preparing a book on change to add to my publications (https://robertolofaro.com/published), and I was looking into speeches delivered by ECB, and the search on the website wasn't what I needed.
Started posting online updates in late 2019, currently the online webapp that allows to search via a tag cloud is updated on a weekly basis, each Monday evening.
Search by tag: https://robertolofaro.com/ECBSpeech (links also to dataset on kaggle)
From 2024-03-25, the dataset contains also the AI-based audio transcripts of any ECB item collected, whenever the audio file is accessible
source: ECB website
In late October/early November 2019, ECB posted on Linkedin a link to a CSV dataset extending from 1997 up to 2019-10-25 with all the speeches delivered, as per their website
The dataset was "flat"- and I needed to both search quickly for associations of people to concepts, and to see directly the relevant speech in a human-readable format (as some speeches had pictures, tables, attachments, etc)
So, I recycled a concept that I had developed for other purposes and used in an experimental "search by tag cloud on structured content" on https://robertolofaro.com/BFM2013tag
The result is https://robertolofaro.com/ECBSpeech, that contains information from the CSV file (see website for the link to the source), with the additional information as shown within the "About this file".
The concept behind this sharing of the dataset on Kaggle, and releasing on my public website the application I use to navigate date (I have a local Xampp where I use this and other applications to support the research side of my past business and current publication activities) is shared on http://robertolofaro.com/datademocracy
This tag cloud contains the most common words 1997-2020 across the dataset
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3925987%2Fcf58205d2447ed7355c1a4e213f5b477%2F20200902_kagglerelease.png?generation=1599033600865103&alt=media" alt="">
Thanks to the ECB for saving my time (I was going to copy-and-paste or "scrape" with R from the speeches posted on their website) by releasing the dataset https://www.ecb.europa.eu/press/key/html/downloads.en.html
In my cultural and organizational change activities and within data collection, collation, and processing to support management decision-making (including my own) since the 1980s, I always saw that the more data we collect, the less time to retrieve it when needed there is.
I usually worked across multiple environments, industries, cultures, and "collecting" was never good enough if I could not then "retrieve by association".
In storytelling is fine just to roughly remember "cameos from the past", but in data storytelling (or when trying to implement a new organization, process, or even just software or data analysis) being able to pinpoint a source that might have been there before is equally important.
So, I am simply exploring different ways to cross-reference information from different domains, as I am quite confident that within all the open data (including the ECB speeches) there are the results of what niche experts saw on various items.
Therefore, why should time and resources be wasted on redoing what was done from others, when you can start from their endpoint, before adapting first, and adopting then (if relevant)?
2020-01-25: added GITHUB repository for versioning and release of additional material as the upload of the new export_datamart.csv wasn't possible, it is now available at: https://github.com/robertolofaro/ecbspeech
changes in the dataset: 1. fixed language codes 2. added speeches published on the ECB website in January 2020 (up to 2020-01-25 09:00 CET) 3. added all the items listed under the "interview" section of the ECB website
current content: 340 interviews, 2374 speeches
2020-01-29: the same file on GITHUB released on 2020-01-25, containing both speeches and interviews, and within an additional column to differentiate between the two, is now available on Kaggle
current content: 340 interviews, 2374 speeches
2020-02-26: monthly update, with items released on the ECB website up to 2020-02-22
current content: 2731 items, 345 interviews, 2386 speeches
2020-03-25: monthly update, with items released on the ECB website up to 2020-03-20
since March 2020, the dataset includes also press conferences available on he ECB website
current content: 2988 records (2392 speeches, 351 interviews, 245 press conferences)
2020-06-07: update, with items released on the ECB website up to 2020-06-07
since June 2020, the dataset includes also press conferences, blog posts, and podcasts available on the ECB website
current content: 3030 records (2399 speeches, 369 interviews, 247 press conferences, 8 blog posts, 7 ECB Podcast). ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Includes Vaccine Adverse Event Reporting System Data Files from 1990 to January 2025. A reproduceable Rmd file, website screenshots, and documentation have been included in the Supplementary Information folder.From the VAERS Data Sets website:VAERS data CSV and compressed (ZIP) files are available for download in the table below. For information about VAERS data, please view the VAERS Data Use Guide [PDF - 310KB], which contains the following information:Important information about VAERS from the FDABrief description of VAERSCautions on interpreting VAERS dataDefinitions of termsDescription of filesList of commonly used abbreviationsSelect the desired time interval to download VAERS data. Each data set is available for download as a compressed (ZIP) file or as individual CSV files. Each compressed file contains the three CSV files listed for a specific data set.Last updated: February 7, 2025.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers.
Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer.
Fork this kernel to get started with this dataset.
Dataset Source: https://archive.org/download/stackexchange
https://bigquery.cloud.google.com/dataset/bigquery-public-data:stackoverflow
https://cloud.google.com/bigquery/public-data/stackoverflow
Banner Photo by Caspar Rubin from Unplash.
What is the percentage of questions that have been answered over the years?
What is the reputation and badge count of users across different tenures on StackOverflow?
What are 10 of the “easier” gold badges to earn?
Which day of the week has most questions answered within an hour?
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The /kaggle/input/online-review-csv/online_review.csv file contains customer reviews from Flipkart. It includes the following columns:
review_id: Unique identifier for each review. product_id: Unique identifier for each product. user_id: Unique identifier for each user. rating: Star rating (1 to 5) given by the user. title: Summary of the review. review_text: Detailed feedback from the user. review_date: Date the review was submitted. verified_purchase: Indicates if the purchase was verified (true/false). helpful_votes: Number of users who found the review helpful. reviewer_name: Name or alias of the reviewer. Uses Sentiment Analysis: Understand customer sentiments. Product Improvement: Identify areas for product enhancement. Market Research: Analyze customer preferences. Recommendation Systems: Improve recommendation algorithms. This dataset is ideal for practicing data analysis and machine learning techniques.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of this document is to accompany the public release of data collected from OpenCon 2015 applications.Download & Technical Information The data can be downloaded in CSV format from GitHub here: https://github.com/RightToResearch/OpenCon-2015-Application-Data The file uses UTF8 encoding, comma as field delimiter, quotation marks as text delimiter, and no byte order mark.
This data is released to the public for free and open use under a CC0 1.0 license. We have a couple of requests for anyone who uses the data. First, we’d love it if you would let us know what you are doing with it, and share back anything you develop with the OpenCon community (#opencon / @open_con ). Second, it would also be great if you would include a link to the OpenCon 2015 website (www.opencon2015.org) wherever the data is used. You are not obligated to do any of this, but we’d appreciate it!
Unique ID
This is a unique ID assigned to each applicant. Numbers were assigned using a random number generator.
Timestamp
This was the timestamp recorded by google forms. Timestamps are in EDT (Eastern U.S. Daylight Time). Note that the application process officially began at 1:00pm EDT June 1 ended at 6:00am EDT on June 23. Some applications have timestamps later than this date, and this is due to a variety of reasons including exceptions granted for technical difficulties, error corrections (which required re-submitting the form), and applications sent in via email and later entered manually into the form. [a]
Gender
Mandatory. Choose one from list or fill-in other. Options provided: Male, Female, Other (fill in).
Country of Nationality
Mandatory. Choose one option from list.
Country of Residence
Mandatory. Choose one option from list.
What is your primary occupation?
Mandatory. Choose one from list or fill-in other. Options provided: Undergraduate student; Masters/professional student; PhD candidate; Faculty/teacher; Researcher (non-faculty); Librarian; Publisher; Professional advocate; Civil servant / government employee; Journalist; Doctor / medical professional; Lawyer; Other (fill in).
Select the option below that best describes your field of study or expertise
Mandatory. Choose one option from list.
What is your primary area of interest within OpenCon’s program areas?
Mandatory. Choose one option from list. Note: for the first approximately 24 hours the options were listed in this order: Open Access, Open Education, Open Data. After that point, we set the form to randomize the order, and noticed an immediate shift in the distribution of responses.
Are you currently engaged in activities to advance Open Access, Open Education, and/or Open Data?
Mandatory. Choose one option from list.
Are you planning to participate in any of the following events this year?
Optional. Choose all that apply from list. Multiple selections separated by semi-colon.
Do you have any of the following skills or interests?
Mandatory. Choose all that apply from list or fill-in other. Multiple selections separated by semi-colon. Options provided: Coding; Website Management / Design; Graphic Design; Video Editing; Community / Grassroots Organizing; Social Media Campaigns; Fundraising; Communications and Media; Blogging; Advocacy and Policy; Event Logistics; Volunteer Management; Research about OpenCon's Issue Areas; Other (fill-in).
This data consists of information collected from people who applied to attend OpenCon 2015. In the application form, questions that would be released as Open Data were marked with a caret (^) and applicants were asked to acknowledge before submitting the form that they understood that their responses to these questions would be released as such. The questions we released were selected to avoid any potentially sensitive personal information, and to minimize the chances that any individual applicant can be positively identified. Applications were formally collected during a 22 day period beginning on June 1, 2015 at 13:00 EDT and ending on June 23 at 06:00 EDT. Some applications have timestamps later than this date, and this is due to a variety of reasons including exceptions granted for technical difficulties, error corrections (which required re-submitting the form), and applications sent in via email and later entered manually into the form. Applications were collected using a Google Form embedded at http://www.opencon2015.org/attend, and the shortened bit.ly link http://bit.ly/AppsAreOpen was promoted through social media. The primary work we did to clean the data focused on identifying and eliminating duplicates. We removed all duplicate applications that had matching e-mail addresses and first and last names. We also identified a handful of other duplicates that used different e-mail addresses but were otherwise identical. In cases where duplicate applications contained any different information, we kept the information from the version with the most recent timestamp. We made a few minor adjustments in the country field for cases where the entry was obviously an error (for example, electing a country listed alphabetically above or below the one indicated elsewhere in the application). We also removed one potentially offensive comment (which did not contain an answer to the question) from the Gender field and replaced it with “Other.”
OpenCon 2015 is the student and early career academic professional conference on Open Access, Open Education, and Open Data and will be held on November 14-16, 2015 in Brussels, Belgium. It is organized by the Right to Research Coalition, SPARC (The Scholarly Publishing and Academic Resources Coalition), and an Organizing Committee of students and early career researchers from around the world. The meeting will convene students and early career academic professionals from around the world and serve as a powerful catalyst for projects led by the next generation to advance OpenCon's three focus areas—Open Access, Open Education, and Open Data. A unique aspect of OpenCon is that attendance at the conference is by application only, and the majority of participants who apply are awarded travel scholarships to attend. This model creates a unique conference environment where the most dedicated and impactful advocates can attend, regardless of where in the world they live or their access to travel funding. The purpose of the application process is to conduct these selections fairly. This year we were overwhelmed by the quantity and quality of applications received, and we hope that by sharing this data, we can better understand the OpenCon community and the state of student and early career participation in the Open Access, Open Education, and Open Data movements.
For inquires about the OpenCon 2015 Application data, please contact Nicole Allen at nicole@sparc.arl.org.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This JavaScript code has been developed to retrieve NDSI_Snow_Cover from MODIS version 6 for SNOTEL sites using the Google Earth Engine platform. To successfully run the code, you should have a Google Earth Engine account. An input file, called NWM_grid_Western_US_polygons_SNOTEL_ID.zip, is required to run the code. This input file includes 1 km grid cells of the NWM containing SNOTEL sites. You need to upload this input file to the Assets tap in the Google Earth Engine code editor. You also need to import the MOD10A1.006 Terra Snow Cover Daily Global 500m collection to the Google Earth Engine code editor. You may do this by searching for the product name in the search bar of the code editor.
The JavaScript works for s specified time range. We found that the best period is a month, which is the maximum allowable time range to do the computation for all SNOTEL sites on Google Earth Engine. The script consists of two main loops. The first loop retrieves data for the first day of a month up to day 28 through five periods. The second loop retrieves data from day 28 to the beginning of the next month. The results will be shown as graphs on the right-hand side of the Google Earth Engine code editor under the Console tap. To save results as CSV files, open each time-series by clicking on the button located at each graph's top right corner. From the new web page, you can click on the Download CSV button on top.
Here is the link to the script path: https://code.earthengine.google.com/?scriptPath=users%2Figarousi%2Fppr2-modis%3AMODIS-monthly
Then, run the Jupyter Notebook (merge_downloaded_csv_files.ipynb) to merge the downloaded CSV files that are stored for example in a folder called output/from_GEE into one single CSV file which is merged.csv. The Jupyter Notebook then applies some preprocessing steps and the final output is NDSI_FSCA_MODIS_C6.csv.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains >800K CSV files behind the GitTables 1M corpus.
For more information about the GitTables corpus, visit:
- our website for GitTables, or