Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Canada Trademarks Dataset
18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303
Dataset Selection and Arrangement (c) 2021 Jeremy Sheff
Python and Stata Scripts (c) 2021 Jeremy Sheff
Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.
This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.
Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.
Terms of Use:
As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.
The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:
The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.
Details of Repository Contents:
This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:
If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.
The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.
With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.
The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.
This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.
Build and customise datasets to match your target audience profile, from a database of 200 million global contacts generated in real-time. Get business contact information that's verified by Leadbook's proprietary A.I. powered data technology.
Our Industry data enables you to reach the prospects and maximize your sales and revenue by offering the most impeccable data. Our data covers several industries that provide result-oriented records to help you build and grow business. Our industry-wise data is a vast repository of verified and opt-in contacts.
Executives and Professionals Contact Data to connect with prospects to effectively market B2B products and services. All of our email addresses come with a 97% deliverability or better guarantee.
Simply specify location, industry, employee headcount, job function and/or seniority attributes, then the platform will verify in real-time their business contact information, and you can download the records in a CSV file.
All records include: - Contact name - Job title - Contact email address - Contact location - Contact LinkedIn URL - Organisation name - Organisation website - Organisation type - Organisation headcount - Primary industry
Additional information like organization phone numbers, organization address, business registration number and secondary industries may be provided where available.
Price starts from USD 0.40 per contact rent & USD 0.80 per contact purchase. Bulk discounts apply.
Train data of Riiid competition is a large dataset of over 100 million rows and 10 columns that does not fit into Kaggle Notebook's RAM using the default pandas read.csv resulting in a search for alternative approaches and formats.
Train data of Riiid competition in different formats.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
reading .CSV file for riiid completion took huge time and memory. This inspired me to convert .CSV in to different file format so that those can be loaded easily to Kaggle kernel.
T1DiabetesGranada
A longitudinal multi-modal dataset of type 1 diabetes mellitus
Documented by:
Rodriguez-Leon, C., Aviles-Perez, M. D., Banos, O., Quesada-Charneco, M., Lopez-Ibarra, P. J., Villalonga, C., & Munoz-Torres, M. (2023). T1DiabetesGranada: a longitudinal multi-modal dataset of type 1 diabetes mellitus. Scientific Data, 10(1), 916. https://doi.org/10.1038/s41597-023-02737-4
Background
Type 1 diabetes mellitus (T1D) patients face daily difficulties in keeping their blood glucose levels within appropriate ranges. Several techniques and devices, such as flash glucose meters, have been developed to help T1D patients improve their quality of life. Most recently, the data collected via these devices is being used to train advanced artificial intelligence models to characterize the evolution of the disease and support its management. The main problem for the generation of these models is the scarcity of data, as most published works use private or artificially generated datasets. For this reason, this work presents T1DiabetesGranada, a open under specific permission longitudinal dataset that not only provides continuous glucose levels, but also patient demographic and clinical information. The dataset includes 257780 days of measurements over four years from 736 T1D patients from the province of Granada, Spain. This dataset progresses significantly beyond the state of the art as one the longest and largest open datasets of continuous glucose measurements, thus boosting the development of new artificial intelligence models for glucose level characterization and prediction.
Data Records
The data are stored in four comma-separated values (CSV) files which are available in T1DiabetesGranada.zip. These files are described in detail below.
Patient_info.csv
Patient_info.csv is the file containing information about the patients, such as demographic data, start and end dates of blood glucose level measurements and biochemical parameters, number of biochemical parameters or number of diagnostics. This file is composed of 736 records, one for each patient in the dataset, and includes the following variables:
Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.
Sex – Sex of the patient. Values: F (for female), masculine (for male)
Birth_year – Year of birth of the patient. Format: YYYY.
Initial_measurement_date – Date of the first blood glucose level measurement of the patient in the Glucose_measurements.csv file. Format: YYYY-MM-DD.
Final_measurement_date – Date of the last blood glucose level measurement of the patient in the Glucose_measurements.csv file. Format: YYYY-MM-DD.
Number_of_days_with_measures – Number of days with blood glucose level measurements of the patient, extracted from the Glucose_measurements.csv file. Values: ranging from 8 to 1463.
Number_of_measurements – Number of blood glucose level measurements of the patient, extracted from the Glucose_measurements.csv file. Values: ranging from 400 to 137292.
Initial_biochemical_parameters_date – Date of the first biochemical test to measure some biochemical parameter of the patient, extracted from the Biochemical_parameters.csv file. Format: YYYY-MM-DD.
Final_biochemical_parameters_date – Date of the last biochemical test to measure some biochemical parameter of the patient, extracted from the Biochemical_parameters.csv file. Format: YYYY-MM-DD.
Number_of_biochemical_parameters – Number of biochemical parameters measured on the patient, extracted from the Biochemical_parameters.csv file. Values: ranging from 4 to 846.
Number_of_diagnostics – Number of diagnoses realized to the patient, extracted from the Diagnostics.csv file. Values: ranging from 1 to 24.
Glucose_measurements.csv
Glucose_measurements.csv is the file containing the continuous blood glucose level measurements of the patients. The file is composed of more than 22.6 million records that constitute the time series of continuous blood glucose level measurements. It includes the following variables:
Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.
Measurement_date – Date of the blood glucose level measurement. Format: YYYY-MM-DD.
Measurement_time – Time of the blood glucose level measurement. Format: HH:MM:SS.
Measurement – Value of the blood glucose level measurement in mg/dL. Values: ranging from 40 to 500.
Biochemical_parameters.csv
Biochemical_parameters.csv is the file containing data of the biochemical tests performed on patients to measure their biochemical parameters. This file is composed of 87482 records and includes the following variables:
Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.
Reception_date – Date of receipt in the laboratory of the sample to measure the biochemical parameter. Format: YYYY-MM-DD.
Name – Name of the measured biochemical parameter. Values: 'Potassium', 'HDL cholesterol', 'Gammaglutamyl Transferase (GGT)', 'Creatinine', 'Glucose', 'Uric acid', 'Triglycerides', 'Alanine transaminase (GPT)', 'Chlorine', 'Thyrotropin (TSH)', 'Sodium', 'Glycated hemoglobin (Ac)', 'Total cholesterol', 'Albumin (urine)', 'Creatinine (urine)', 'Insulin', 'IA ANTIBODIES'.
Value – Value of the biochemical parameter. Values: ranging from -4.0 to 6446.74.
Diagnostics.csv
Diagnostics.csv is the file containing diagnoses of diabetes mellitus complications or other diseases that patients have in addition to type 1 diabetes mellitus. This file is composed of 1757 records and includes the following variables:
Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.
Code – ICD-9-CM diagnosis code. Values: subset of 594 of the ICD-9-CM codes (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes).
Description – ICD-9-CM long description. Values: subset of 594 of the ICD-9-CM long description (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes).
Technical Validation
Blood glucose level measurements are collected using FreeStyle Libre devices, which are widely used for healthcare in patients with T1D. Abbott Diabetes Care, Inc., Alameda, CA, USA, the manufacturer company, has conducted validation studies of these devices concluding that the measurements made by their sensors compare to YSI analyzer devices (Xylem Inc.), the gold standard, yielding results of 99.9% of the time within zones A and B of the consensus error grid. In addition, other studies external to the company concluded that the accuracy of the measurements is adequate.
Moreover, it was also checked in most cases the blood glucose level measurements per patient were continuous (i.e. a sample at least every 15 minutes) in the Glucose_measurements.csv file as they should be.
Usage Notes
For data downloading, it is necessary to be authenticated on the Zenodo platform, accept the Data Usage Agreement and send a request specifying full name, email, and the justification of the data use. This request will be processed by the Secretary of the Department of Computer Engineering, Automatics, and Robotics of the University of Granada and access to the dataset will be granted.
The files that compose the dataset are CSV type files delimited by commas and are available in T1DiabetesGranada.zip. A Jupyter Notebook (Python v. 3.8) with code that may help to a better understanding of the dataset, with graphics and statistics, is available in UsageNotes.zip.
Graphs_and_stats.ipynb
The Jupyter Notebook generates tables, graphs and statistics for a better understanding of the dataset. It has four main sections, one dedicated to each file in the dataset. In addition, it has useful functions such as calculating the patient age, deleting a patient list from a dataset file and leaving only a patient list in a dataset file.
Code Availability
The dataset was generated using some custom code located in CodeAvailability.zip. The code is provided as Jupyter Notebooks created with Python v. 3.8. The code was used to conduct tasks such as data curation and transformation, and variables extraction.
Original_patient_info_curation.ipynb
In the Jupyter Notebook is preprocessed the original file with patient data. Mainly irrelevant rows and columns are removed, and the sex variable is recoded.
Glucose_measurements_curation.ipynb
In the Jupyter Notebook is preprocessed the original file with the continuous glucose level measurements of the patients. Principally rows without information or duplicated rows are removed and the variable with the timestamp is transformed into two new variables, measurement date and measurement time.
Biochemical_parameters_curation.ipynb
In the Jupyter Notebook is preprocessed the original file with patient data of the biochemical tests performed on patients to measure their biochemical parameters. Mainly irrelevant rows and columns are removed and the variable with the name of the measured biochemical parameter is translated.
Diagnostic_curation.ipynb
In the Jupyter Notebook is preprocessed the original file with patient data of the diagnoses of diabetes mellitus complications or other diseases that patients have in addition to T1D.
Get_patient_info_variables.ipynb
In the Jupyter Notebook it is coded the feature extraction process from the files Glucose_measurements.csv, Biochemical_parameters.csv and Diagnostics.csv to complete the file Patient_info.csv. It is divided into six sections, the first three to extract the features from each of the mentioned files and the next three to add the extracted features to the resulting new file.
Data Usage Agreement
The conditions for use are as follows:
You confirm that you will not attempt to re-identify research participants for any reason, including for re-identification theory research.
You commit to keeping the T1DiabetesGranada dataset confidential and secure and will not redistribute data or Zenodo account credentials.
You will require
CompanyData.com, powered by BoldData, provides verified company information sourced directly from official trade registers. Our Vietnam database features 1,828,945 company records, offering a reliable and up-to-date foundation for your business needs.
Each Vietnamese company profile includes detailed firmographic data such as company name, registration number, legal form, industry classification, revenue, and employee count. Many records also contain contact details like emails and mobile numbers of decision-makers, helping you connect directly with the right businesses.
Our Vietnam data is trusted for a wide range of applications including compliance, KYC verification, lead generation, market research, sales and marketing campaigns, CRM enrichment, and AI training. Every record is curated for accuracy and relevance, ensuring your strategies are built on reliable information.
Choose the delivery method that suits your business best. We offer tailored company lists, complete national databases, real-time API access, and ready-to-use Excel or CSV files. Our enrichment services further enhance your existing data with fresh, verified information.
With access to more than 380 million verified companies worldwide, CompanyData.com helps businesses grow locally in Vietnam and scale globally with confidence. Let us power your data-driven decisions with precision, quality, and reach.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
You can also access an API version of this dataset.
TMS
(traffic monitoring system) daily-updated traffic counts API
Important note: due to the size of this dataset, you won't be able to open it fully in Excel. Use notepad / R / any software package which can open more than a million rows.
Data reuse caveats: as per license.
Data quality
statement: please read the accompanying user manual, explaining:
how
this data is collected identification
of count stations traffic
monitoring technology monitoring
hierarchy and conventions typical
survey specification data
calculation TMS
operation.
Traffic
monitoring for state highways: user manual
[PDF 465 KB]
The data is at daily granularity. However, the actual update
frequency of the data depends on the contract the site falls within. For telemetry
sites it's once a week on a Wednesday. Some regional sites are fortnightly, and
some monthly or quarterly. Some are only 4 weeks a year, with timing depending
on contractors’ programme of work.
Data quality caveats: you must use this data in
conjunction with the user manual and the following caveats.
The
road sensors used in data collection are subject to both technical errors and
environmental interference.Data
is compiled from a variety of sources. Accuracy may vary and the data
should only be used as a guide.As
not all road sections are monitored, a direct calculation of Vehicle
Kilometres Travelled (VKT) for a region is not possible.Data
is sourced from Waka Kotahi New Zealand Transport Agency TMS data.For
sites that use dual loops classification is by length. Vehicles with a length of less than 5.5m are
classed as light vehicles. Vehicles over 11m long are classed as heavy
vehicles. Vehicles between 5.5 and 11m are split 50:50 into light and
heavy.In September 2022, the National Telemetry contract was handed to a new contractor. During the handover process, due to some missing documents and aged technology, 40 of the 96 national telemetry traffic count sites went offline. Current contractor has continued to upload data from all active sites and have gradually worked to bring most offline sites back online. Please note and account for possible gaps in data from National Telemetry Sites.
The NZTA Vehicle
Classification Relationships diagram below shows the length classification (typically dual loops) and axle classification (typically pneumatic tube counts),
and how these map to the Monetised benefits and costs manual, table A37,
page 254.
Monetised benefits and costs manual [PDF 9 MB]
For the full TMS
classification schema see Appendix A of the traffic counting manual vehicle
classification scheme (NZTA 2011), below.
Traffic monitoring for state highways: user manual [PDF 465 KB]
State highway traffic monitoring (map)
State highway traffic monitoring sites
Violations issued by the Department of Buildings in the City of Chicago from 2006 to the present. The dataset contains more than 1 million records/rows of data and cannot be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Wordpad, to view and search. Violations are always associated to an inspection and there can be multiple violation records to one(1) inspection record. Data fields requiring description are detailed below. VIOLATION DATE: The date the violation was cited. INSPECTION CATEGORY: Inspections are categorized by one of the following: COMPLAINT – Inspection is a result of a 311 Complaint PERIODIC – Inspection is a result of recurring inspection (typically on an annual cycle) PERMIT – Inspection is a result of a Permit REGISTRATION – Inspection is a result of a Registration (typically Vacant Building Registration) PROPERTY GROUP: Properties (lots) in the City of Chicago can typically have multiple point addresses, range addresses and buildings. Examples are corner lots, large lots, lots with front and rear buildings, etc.. As a result, inspections (and their associated violations), permits and complaints related to a single property could have different addresses. This problem can be reconciled by using Property Group. All point and range addresses for a property are assigned the same Property Group key. Data Owner: Buildings Time Period: January 1, 2006 to present Frequency: Data is updated daily Related Applications: Building Data Warehouse http://www.cityofchicago.org/city/en/depts/bldgs/provdrs/inspect/svcs/building_violationsonline.html
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a collection of developer comments from GitHub issues, commits, and pull requests. We collected 88,640,237 developer comments from 17,378 repositories. In total, this dataset includes:
54,252,380 issue comments (from 13,458,208 issues)
979,642 commit comments (from 49,710,108 commits)
33,408,215 pull request comments (from 12,680,373 pull requests)
Warning: The uploaded dataset is compressed from 185GB down to 25.1GB.
Purpose
The purpose of this dataset (corpus) is to provide a large dataset of software developer comments (natural language) for research. We intend to use this data in our own research, but we hope it will be helpful for other researchers.
Collection Process
Full implementation details can be found in the following publication:
Benjamin S. Meyers. Human Error Assessment in Software Engineering. Rochester Institute of Technology. 2023.
Data was downloaded using GitHub's GraphQL API via requests made with Python's requests library. We targeted 17,491 repositories with the following criteria:
At least 850 stars.
Primary language in the Top 50 from the TIOBE Index and/or listed as "popular" in GitHub's advanced search. Note that we collected the list of languages on August 31, 2021.
Due to design decisions made by GitHub, we could only get a list of at most 1,000 repositories for each target language. Comments from 113 repositories could not be downloaded for various reasons (failing API queries, JSONDecoderErrors, etc.). Eight target languages had no repositories matching the above criteria.
After collection using the GraphQL API, data was written to CSV using Python's csv.writer class. We highly recommend using Python's csv.reader to parse these CSV files as no newlines have been removed from developer comments.
88_million_developer_comments.zip
This zip file contains 135 CSV files; 3 per language. CSV names are formatted _.csv, with being the name of the primary language and being one of co (commits), is (issues), or pr (pull requests).
Languages included are: ABAP, Assembly, C, C# (C-Sharp), C++ (C-PlusPlus), Clojure, COBOL, CoffeeScript, CSS, Dart, D, DM, Elixir, Fortran, F# (F-Sharp), Go, Groovy, HTML, Java, JavaScript, Julia, Kotlin, Lisp, Lua, MATLAB, Nim, Objective-C, Pascal, Perl, PHP, PowerShell, Prolog, Python, R, Ruby, Rust, Scala, Scheme, Scratch, Shell, Swift, TSQL, TypeScript, VBScript, and VHDL.
Details on the columns in each CSV file are described in the provided README.md.
Detailed_Breakdown.ods
This spreadsheet contains specific details on how many repositories, commits, issues, pull requests, and comments are included in 88_million_developer_comments.zip.
Note On Completeness
We make no guarantee that every commit, issue, and/or pull request for each repository is included in this dataset. Due to the nature of the GraphQL API and data decoding difficulties, sometimes a query failed and that data is not included here.
Versioning
v1.1: The original corpus had duplicate header rows in the CSV files. This has been fixed.
v1.0: Original corpus.
Contact
Please contact Benjamin S. Meyers (email) with questions about this data and its collection.
Acknowledgments
Collection of this data has been sponsored in part by the National Science Foundation grant 1922169, and by a Department of Defense DARPA SBIR program (grant 140D63-19-C-0018).
This data was collected using the compute resources from the Research Computing department at the Rochester Institute of Technology. doi:10.34788/0S3G-QD15
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Get access to a comprehensive and structured dataset of BBC News articles, freshly crawled and compiled in February 2023. This collection includes 1 million records from one of the world’s most trusted news organizations — perfect for training NLP models, sentiment analysis, and trend detection across global topics.
💾 Format: CSV (available in ZIP archive)
📢 Status: Published and available for immediate access
Train language models to summarize or categorize news
Detect media bias and compare narrative framing
Conduct research in journalism, politics, and public sentiment
Enrich news aggregation platforms with clean metadata
Analyze content distribution across categories (e.g. health, politics, tech)
This dataset ensures reliable and high-quality information sourced from a globally respected outlet. The format is optimized for quick ingestion into your pipelines — with clean text, timestamps, image links, and more.
Need a filtered dataset or want this refreshed for a later date? We offer on-demand news scraping as well.
👉 Request access or sample now
https://opendata.nhsbsa.net/dataset/foi-01502 September 2023 https://opendata.nhsbsa.net/dataset/foi-01550 October 2023 https://opendata.nhsbsa.net/dataset/foi-01668 November 2023 https://opendata.nhsbsa.net/dataset/foi-01669 December 2023 https://opendata.nhsbsa.net/dataset/foi-01756 Some data sets are over 1 million rows of data and it may be that you will need to use add-ons already existing on Microsoft Excel to enable you to view the data set in its entirety. Microsoft PowerPivot add-on for Excel can be used to handle larger data sets. The Microsoft PowerPivot add-on for Excel is available using the link in the 'Related Links' section below: https://www.microsoft.com/en-us/download/details.aspx?id=43348 Once PowerPivot has been installed, to load the large files, please follow the instructions below: 1. Start Excel as normal 2. Click on the PowerPivot tab 3. Click on the PowerPivot Window icon (top left) 4. In the PowerPivot Window, click on the "From Other Sources" icon 5. In the Table Import Wizard e.g. scroll to the bottom and select Text File 6. Browse to the file you want to open and choose the file extension you require e.g. CSV
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General information
This repository includes all data needed to reproduce the experiments presented in [1].The paper describes the BF skip index, a data structure based on Bloom filters [2] that can be used for answering inter-block queries on blockchains efficiently. The article also includes a historical analysis of logsBloom filters included in the Ethereum block headers, as well as an experimental analysis of the proposed data structure. The latter was conducted using the data set of events generated by the CryptoKitties Core contract, a popular decentralized application launched in 2017 (and also one of the first applications based on NFTs).
In this description, we use the following abbreviations (also adopted throughout the paper) to denote two different sets of Ethereum blocks.
D1: set of all Ethereum blocks between height 0 and 14999999.
D2: set of all Ethereum blocks between height 14000000 and 14999999.
Moreover, in accordance with the terminology adopted in the paper, we define the set of keys of a block as the set of all contract addresses and log topics of the transactions in the block. As defined in [3], log topics comprise event signature digests and the indexed parameters associated with the event occurrence.
Data set description
File Description
filters_ones_0-14999999.csv.xz Compressed CSV file containing the number of ones for each logsBloom filter in D1.
receipt_stats_0-14999999.csv.xz Compressed CSV file containing statistics about all transaction receipts in D1.
Approval.csv CSV file containing the Approval event occurrences for the CryptoKitties Core contract in D2.
Birth.csv CSV file containing the Birth event occurrences for the CryptoKitties Core contract in D2.
Pregnant.csv CSV file containing the Pregnant event occurrences for the CryptoKitties Core contract in D2.
Transfer.csv CSV file containing the Transfer event occurrences for the CryptoKitties Core contract in D2.
events.xz Compressed binary file containing information about all contract events in D2.
keys.xz Compressed binary file containing information about all keys in D2.
File structure
We now describe the structure of the files included in this repository.
filters_ones_0-14999999.csv.xz is a compressed CSV file with 15 million rows (one for each block in D1) and 3 columns. Note that it is not necessary to decompress this file, as the provided code is capable of processing it directly in its compressed form. The columns have the following meaning.
blockId: the identifier of the block.
timestamp: timestamp of the block.
numOnes: number of bits set to 1 in the logsBloom filter of the block.
receipt_stats_0-14999999.csv.xz is a compressed CSV file with 15 million rows (one for each block in D1) and 5 columns. As for the previous file, it is not necessary to decompress this file.
blockId: the identifier of the block.
txCount: number of transactions included in the block.
numLogs: number of event logs included in the block.
numKeys: number of keys included in the block.
numUniqueKeys: number of distinct keys in the block (useful as the same key may appear multiple times).
All CSV files related to the CryptoKitties Core events (i.e., Approval.csv, Birth.csv, Pregnant.csv, Transfer.csv) have the same structure. They consist of 1 million rows (one for each block in D2) and 2 columns, namely:
blockId: identifier of the block.
numOcc: number of event occurrences in the block.
events.xz is a compressed binary file describing all unique event occurrences in the blocks of D2. The file contains 1 million data chunks (i.e., one for each Ethereum block). Each chunk includes the following information. Do note that this file only records unique event occurrences in each block, meaning that if an event from a contract is triggered more than once within the same block, there will be only one sequence within the corresponding chunk.
blockId: identifier of the block (4 bytes).
numEvents: number of event occurrences in the block (4 bytes).
A list of numEvent sequences, each made up of 52 bytes. A sequence represents an event occurrence and is indeed the concatenation of two fields, namely:
Address of the contract triggering the event (20 bytes).
Event signature digest (32 bytes).
keys.xz is a compressed binary file describing all unique keys in the blocks of D2. As for the previous file, duplicate keys only appear once. The file contains 1 million data chunks, each representing an Ethereum block and including the following information.
blockId: identifier of the block (4 bytes)
numAddr: number of unique contract addresses (4 bytes).
numTopics: number of unique topics (4 bytes).
A sequence of numAddr addresses, each represented using 20 bytes.
A sequence of numTopics topics, each represented using 32 bytes.
Notes
For space reasons, some of the files in this repository have been compressed using the XZ compression utility. Unless otherwise specified, these files need to be decompressed before they can be read. Please make sure you have an application installed on your system that is capable of decompressing such files.
Cite this work
If the data included in this repository have been useful, please cite the following article in your work.
@article{loporchio2025skip, title={Skip index: Supporting efficient inter-block queries and query authentication on the blockchain}, author={Loporchio, Matteo and Bernasconi, Anna and Di Francesco Maesa, Damiano and Ricci, Laura}, journal={Future Generation Computer Systems}, volume={164}, pages={107556}, year={2025}, publisher={Elsevier} }
References
Loporchio, Matteo et al. "Skip index: supporting efficient inter-block queries and query authentication on the blockchain". Future Generation Computer Systems 164 (2025): 107556. https://doi.org/10.1016/j.future.2024.107556
Bloom, Burton H. "Space/time trade-offs in hash coding with allowable errors." Communications of the ACM 13.7 (1970): 422-426.
Wood, Gavin. "Ethereum: A secure decentralised generalised transaction ledger." Ethereum project yellow paper 151.2014 (2014): 1-32.
Violations issued by the Department of Buildings in the City of Chicago from 2006 to the present. The dataset contains more than 1 million records/rows of data and cannot be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Wordpad, to view and search. Violations are always associated to an inspection and there can be multiple violation records to one(1) inspection record. Data fields requiring description are detailed below. VIOLATION DATE: The date the violation was cited. INSPECTION CATEGORY: Inspections are categorized by one of the following: COMPLAINT – Inspection is a result of a 311 Complaint PERIODIC – Inspection is a result of recurring inspection (typically on an annual cycle) PERMIT – Inspection is a result of a Permit REGISTRATION – Inspection is a result of a Registration (typically Vacant Building Registration) PROPERTY GROUP: Properties (lots) in the City of Chicago can typically have multiple point addresses, range addresses and buildings. Examples are corner lots, large lots, lots with front and rear buildings, etc.. As a result, inspections (and their associated violations), permits and complaints related to a single property could have different addresses. This problem can be reconciled by using Property Group. All point and range addresses for a property are assigned the same Property Group key. Data Owner: Buildings Time Period: January 1, 2006 to present Frequency: Data is updated daily Related Applications: Building Data Warehouse http://www.cityofchicago.org/city/en/depts/bldgs/provdrs/inspect/svcs/building_violationsonline.html
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains 250 million rows
of information from the ~500 bike stations
of the Barcelona public bicycle sharing service. The data consists in time series information of the electric and mechanical bicycles available every 4 minutes
aprox., from March 2019 to March 2024
(latest available csv file, with the idea of being updated with every new month's file). This data could inspire many different use cases, from geographical data analysis to hierarchical ML time series models or Graph Neural Networks among others. Feel free to create a New Notebook from this page to use it and share your ideas with everyone!
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3317928%2F64409b5bd3c220993e05f5e155fd8c25%2Fstations_map_2024.png?generation=1713725887609128&alt=media" alt="">
Every month's information is separated in a different file as {year}_{month}_STATIONS.csv
. Then the metadata info of every station has been simplified and compressed in the {year}_INFO.csv
files where there is a single entry for every station and day, separated in a different file for every year.
The original data has some different errors, few of them have been already corrected but there are still some missing values, columns with wrong data types and other fewer artifacts or missing data. From time to time I may be manually correcting more of those.
The data is collected from the public BCN Open Data website, which is available for everyone (some resources need from creating a free account and token): - Stations data: https://opendata-ajuntament.barcelona.cat/data/en/dataset/estat-estacions-bicing - Stations info: https://opendata-ajuntament.barcelona.cat/data/en/dataset/informacio-estacions-bicing
You can find more information in them.
Please, consider upvoting this dataset if you find it interesting! 🤗
Some observations:
The historical data for June '19 does not have data for the 20th between 7:40 am and 2:00 pm.
The historical data for July '19 does not have data from the 26th at 1:30 pm until the 29th at 10:40 am.
The historical data for November '19 may not have some data from 10:00 pm on the 26th to 11:00 am on the 27th.
The historical data for August '20 does not have data from the 7th at 2:25 am until the 10th at 10:40 am.
The historical data for November '20 does not have data on the following days/times: 4th from 1:45 am to 11:05 am 20th from 7:50 pm to the 21st at 10:50 am 27th from 2:50 am to the 30th at 9:50 am.
The historical data for August '23 does not have data from the 22nd to the 31st due to a technical incident.
The historical data for September '23 does not have data from the 1st to the 5th due to a technical incident.
The historical data for February '24 does not have data on the 5th between 12:50 pm and 1:05 pm.
Others: Due to COVID-19 measures, the Bicing service was temporarily stopped, reflecting this situation in the historical data.
Field Description:
Array of data for each station:
station_id
: Identifier of the station
num_bikes_available
: Number of available bikes
num_bikes_available_types
: Array of types of available bikes
mechanical
: Number of available mechanical bikes
ebike
: Number of available electric bikes
num_docks_available
: Number of available docks
is_installed
: The station is properly installed (0-NO,1-YES)
is_renting
: The station is providing bikes correctly
is_returning
: The station is docking bikes correctly
last_reported
: Timestamp of the station information
is_charging_station
: The station has electric bike charging capacity
status
: Status of the station (IN_SERVICE=In service, CLOSED=Closed)
Violations issued by the Department of Buildings in the City of Chicago from 2006 to the present. The dataset contains more than 1 million records/rows of data and cannot be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Wordpad, to view and search. Violations are always associated to an inspection and there can be multiple violation records to one(1) inspection record. Data fields requiring description are detailed below. VIOLATION DATE: The date the violation was cited. INSPECTION CATEGORY: Inspections are categorized by one of the following: COMPLAINT – Inspection is a result of a 311 Complaint PERIODIC – Inspection is a result of recurring inspection (typically on an annual cycle) PERMIT – Inspection is a result of a Permit REGISTRATION – Inspection is a result of a Registration (typically Vacant Building Registration) PROPERTY GROUP: Properties (lots) in the City of Chicago can typically have multiple point addresses, range addresses and buildings. Examples are corner lots, large lots, lots with front and rear buildings, etc.. As a result, inspections (and their associated violations), permits and complaints related to a single property could have different addresses. This problem can be reconciled by using Property Group. All point and range addresses for a property are assigned the same Property Group key. Data Owner: Buildings Time Period: January 1, 2006 to present Frequency: Data is updated daily Related Applications: Building Data Warehouse http://www.cityofchicago.org/city/en/depts/bldgs/provdrs/inspect/svcs/building_violationsonline.html
CompanyData.com powered by BoldData is your gateway to verified, high-quality business data from around the world. We specialize in delivering structured company information sourced directly from official trade registers, giving you reliable data to fuel smarter business decisions.
Our USA company database includes over 69,853,300 verified business records, making it one of the most comprehensive sources of company information available. Each record contains detailed firmographics such as industry classification, company size and revenue, corporate hierarchies and verified contact details including decision-maker names, email addresses, direct dials and mobile numbers.
This rich dataset supports a wide range of use cases including - Regulatory compliance and KYC verification - Sales prospecting and lead generation - B2B marketing and audience segmentation - CRM enrichment and data cleansing - Training data for AI and machine learning models
We offer flexible delivery options tailored to your workflow - Tailored company lists filtered by location, size, industry and more - Full USA company database exports in Excel or CSV - Real-time API access for seamless data integration - Data enrichment services to enhance your internal records
The United States is a key part of our global database of over 69,853,300 verified companies across more than 200 countries. Whether you are expanding into the US market or enriching global CRM systems, we deliver the accuracy, scale and flexibility your business demands.
Partner with CompanyData.com to unlock actionable company intelligence in the USA delivered how you need it, when you need it, with the precision your business deserves.
The FDA Device Dataset by Dataplex provides comprehensive access to over 24 million rows of detailed information, covering 9 key data types essential for anyone involved in the medical device industry. Sourced directly from the U.S. Food and Drug Administration (FDA), this dataset is a critical resource for regulatory compliance, market analysis, and product safety assessment regarding.
Dataset Overview:
This dataset includes data on medical device registrations, approvals, recalls, and adverse events, among other crucial aspects. The dataset is meticulously cleaned and structured to ensure that it meets the needs of researchers, regulatory professionals, and market analysts.
24 Million Rows of Data:
With over 24 million rows, this dataset offers an extensive view of the regulatory landscape for medical devices. It includes data types such as classification, event, enforcement, 510k, registration listings, recall, PMA, UDI, and covid19 serology. This wide range of data types allows users to perform granular analysis on a broad spectrum of device-related topics.
Sourced from the FDA:
All data in this dataset is sourced directly from the FDA, ensuring that it is accurate, up-to-date, and reliable. Regular updates ensure that the dataset remains current, reflecting the latest in device approvals, clearances, and safety reports.
Key Features:
Comprehensive Coverage: Includes 9 key device data types, such as 510(k) clearances, premarket approvals, device classifications, and adverse event reports.
Regulatory Compliance: Provides detailed information necessary for tracking compliance with FDA regulations, including device recalls and enforcement actions.
Market Analysis: Analysts can utilize the dataset to assess market trends, monitor competitor activities, and track the introduction of new devices.
Product Safety Analysis: Researchers can analyze adverse event reports and device recalls to evaluate the safety and performance of medical devices.
Use Cases: - Regulatory Compliance: Ensure your devices meet FDA standards, monitor compliance trends, and stay informed about regulatory changes.
Market Research: Identify trends in the medical device market, track new device approvals, and analyze competitive landscapes with up-to-date and historical data.
Product Safety: Assess the safety and performance of medical devices by examining detailed adverse event reports and recall data.
Data Quality and Reliability:
The FDA Device Dataset prioritizes data quality and reliability. Each record is meticulously sourced from the FDA's official databases, ensuring that the information is both accurate and up-to-date. This makes the dataset a trusted resource for critical applications, where data accuracy is vital.
Integration and Usability:
The dataset is provided in CSV format, making it compatible with most data analysis tools and platforms. Users can easily import, analyze, and utilize the data for various applications, from regulatory reporting to market analysis.
User-Friendly Structure and Metadata:
The data is organized for easy navigation, with clear metadata files included to help users identify relevant records. The dataset is structured by device type, approval and clearance processes, and adverse event reports, allowing for efficient data retrieval and analysis.
Ideal For:
Regulatory Professionals: Monitor FDA compliance, track regulatory changes, and prepare for audits with comprehensive and up-to-date product data.
Market Analysts: Conduct detailed research on market trends, assess new device entries, and analyze competitive dynamics with extensive FDA data.
Healthcare Researchers: Evaluate the safety and efficacy of medical devices product data, identify potential risks, and contribute to improved patient outcomes through detailed analysis.
This dataset is an indispensable resource for anyone involved in the medical device industry, providing the data and insights necessary to drive informed decisions and ensure compliance with FDA regulations.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data file for the first release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.
The data file includes 10,006,058 data citation records in JSON and CSV formats. The JSON file is the version of record.
Version 1.0 of the corpus data file was released on January 30, 2024. Release v1.1 is an optimized version of v1.0 designed to make the original citation records more usable. No citations have been added to or removed from the dataset in v1.1.
For convenience, the data file is provided in batches of approximately 1 million records each. The publication date and batch number are included in each component file name, ex: 2024-05-10-data-citation-corpus-01-v1.1.json.
The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.
Each data citation record is comprised of:
A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication object (journal article or preprint) in which the dataset is cited
Metadata for the cited dataset and for the citing publication object
The data file includes the following fields:
Field |
Description |
Required? |
id |
Internal identifier for the citation |
Yes |
created |
Date of item's incorporation into the corpus |
Yes |
updated |
Date of item's most recent update in corpus |
Yes |
repository |
Repository where cited data is stored |
No |
publisher |
Publisher for the article citing the data |
No |
journal |
Journal for the article citing the data |
No |
title |
Title of cited data |
No |
objId |
DOI of article where data is cited |
Yes |
subjId |
DOI or accession number of cited data |
Yes |
publishedDate |
Date when citing article was published |
No |
accessionNumber |
Accession number of cited data |
No |
doi |
DOI of cited data |
No |
relationTypeId |
Relation type in metadata between citation object and subject |
No |
source |
Source where citation was harvested |
Yes |
subjects |
Subject information for cited data |
No |
affiliations |
Affiliation information for creator of cited data |
No |
funders |
Funding information for cited data |
No |
Additional documentation about the citations and metadata in the file is available on the Make Data Count website.
Feedback on the data file can be submitted via Github. For general questions, email info@makedatacount.org.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
The Fox News Dataset is a comprehensive collection of over 1 million news articles, offering an unparalleled resource for analyzing media narratives, public discourse, and political trends. Covering articles up to the year 2023, this dataset is a treasure trove for researchers, analysts, and businesses interested in gaining deeper insights into the topics and trends covered by Fox News.
This large dataset is ideal for:
Discover additional resources for your research needs by visiting our news dataset collection. These datasets are tailored to support diverse analytical applications, including sentiment analysis and trend modeling.
The Fox News Dataset is a must-have for anyone interested in exploring large-scale media data and leveraging it for advanced analysis. Ready to dive into this wealth of information? Download the dataset now in CSV format and start uncovering the stories behind the headlines.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The MNAD corpus is a collection of over 1 million Moroccan news articles written in modern Arabic language. These news articles have been gathered from 11 prominent electronic news sources. The dataset is made available to the academic community for research purposes, such as data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), and other non-commercial activities.
Version 1 of the dataset comprises 418,563 articles classified into 19 categories. The data was collected from well-known electronic news sources, namely Akhbarona.ma, Hespress.ma, Hibapress.com, and Le360.com. The articles were stored in four separate CSV files, each corresponding to the news website source. Each CSV file contains three fields: Title, Body, and Category of the news article.
The dataset is rich in Arabic vocabulary, with approximately 906,125 unique words. It has been utilized as a benchmark in the research paper:
"A Moroccan News Articles Dataset (MNAD) For Arabic Text Categorization". In 2021 International Conference on Decision Aid Sciences and Application (DASA).
This dataset is available for download from the following sources: - Kaggle Datasets : MNADv1 - Huggingface Datasets: MNADv1
Version 2 of the MNAD dataset includes an additional 653,901 articles, bringing the total number of articles to over 1 million (1,069,489), classified into the same 19 categories as in version 1. The new documents were collected from seven additional prominent Moroccan news websites, namely al3omk.com, medi1news.com, alayam24.com, anfaspress.com, alyaoum24.com, barlamane.com, and SnrtNews.com.
The newly collected articles have been merged with the articles from the previous version into a single CSV file named MNADv2.csv
. This file includes an additional column called "Source" to indicate the source of each news article.
Furthermore, MNAD.v2 incorporates improved pre-processing techniques and data cleaning methods. These enhancements involve removing duplicates, eliminating multiple spaces, discarding rows with NaN values, replacing new lines with " ", excluding very long and very short articles, and removing non-Arabic articles. These additions and improvements aim to enhance the usability and value of the MNAD dataset for researchers and practitioners in the field of Arabic text analysis.
This dataset is available for download from the following sources: - Kaggle Datasets : MNADv2 - Huggingface Datasets: MNADv2
If you use our data, please cite the following paper:
@inproceedings{MNAD2021,
author = {Mourad Jbene and
Smail Tigani and
Rachid Saadane and
Abdellah Chehri},
title = {A Moroccan News Articles Dataset ({MNAD}) For Arabic Text Categorization},
year = {2021},
publisher = {{IEEE}},
booktitle = {2021 International Conference on Decision Aid Sciences and Application ({DASA})}
doi = {10.1109/dasa53625.2021.9682402},
url = {https://doi.org/10.1109/dasa53625.2021.9682402},
}
Verified company data in Greece from CompanyData.com (BoldData)
CompanyData.com (BoldData) delivers accurate, up-to-date company information sourced directly from official Greek trade registers and reliable local authorities. Whether you're entering the Greek market or enhancing existing operations, our verified data helps you identify, understand, and connect with the right businesses.
Over 1,073,791 verified company records in Greece Our Greek business database covers over 1.3 million companies, from family-owned enterprises in Thessaloniki to major corporations based in Athens. Every record is verified and updated regularly to ensure data integrity, regulatory compliance, and maximum relevance.
Key data fields include:
• Company names, registration numbers, and legal structures • Contact information including emails, phone numbers, and mobile numbers • Executive names and key decision-makers • Industry classification, turnover, number of employees, and founding year • Business status, operational activity, and hierarchy insights
Use Greece company data for a wide range of applications Our verified data supports many essential business functions:
• KYC and anti-money laundering compliance • B2B sales prospecting and marketing campaigns • CRM and database enrichment • Market research and strategic analysis • Artificial intelligence and machine learning training
Flexible delivery tailored to your needs Whether you need a one-time list or a continuous data stream, we offer:
• Custom-built B2B lists tailored to your target audience • Full datasets of all active Greek companies • Real-time API access for seamless integration • Delivery in Excel, CSV, or through our easy-to-use platform
Part of a global business data network Our Greece dataset is part of our worldwide coverage of 1,073,791 verified company records across more than 200 countries. With years of experience and a strong commitment to quality, CompanyData.com (BoldData) helps organizations worldwide find the right business data to grow, scale, and stay compliant.
Access trusted company data from Greece and beyond with CompanyData.com.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Canada Trademarks Dataset
18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303
Dataset Selection and Arrangement (c) 2021 Jeremy Sheff
Python and Stata Scripts (c) 2021 Jeremy Sheff
Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.
This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.
Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.
Terms of Use:
As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.
The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:
The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.
Details of Repository Contents:
This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:
If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.
The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.
With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.
The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.
This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.