Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Canada Trademarks Dataset
18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303
Dataset Selection and Arrangement (c) 2021 Jeremy Sheff
Python and Stata Scripts (c) 2021 Jeremy Sheff
Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.
This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.
Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.
Terms of Use:
As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.
The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:
The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.
Details of Repository Contents:
This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:
If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.
The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.
With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.
The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.
This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This dataset lists the public and internals datasets published on the City of Austin Open Data Portal filtered to the Austin Transportation and Public Works department. Dataset types include stories, charts, datasets, filters, embedded links, and files. This dataset is maintained by the Data and Technology Services division in the department.
https://data.gov.tw/licensehttps://data.gov.tw/license
The interpretation data listed in the government's data open platform dataset includes the dataset name, file format, download link, dataset type, dataset description, main field description, dataset provider, update frequency, authorization, authorization explanation URL, billing method, encoding format, dataset provider contact person, dataset provider contact person phone, and remarks.
Sensitive Regulated Data: Permitted and Restricted UsesPurposeScope and AuthorityStandardViolation of the Standard - Misuse of InformationDefinitionsReferencesAppendix A: Personally Identifiable Information (PII)Appendix B: Security of Personally Owned Devices that Access or Maintain Sensitive Restricted DataAppendix C: Sensitive Security Information (SSI)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.
For more details see the included README file and companion paper:
Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.
If you use this dataset for research purposes, please acknowledge its use by citing the above paper.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
The open data portal catalogue is a downloadable dataset containing some key metadata for the general datasets available on the Government of Canada's Open Data portal. Resource 1 is generated using the ckanapi tool (external link) Resources 2 - 8 are generated using the Flatterer (external link) utility. ###Description of resources: 1. Dataset is a JSON Lines (external link) file where the metadata of each Dataset/Open Information Record is one line of JSON. The file is compressed with GZip. The file is heavily nested and recommended for users familiar with working with nested JSON. 2. Catalogue is a XLSX workbook where the nested metadata of each Dataset/Open Information Record is flattened into worksheets for each type of metadata. 3. datasets metadata contains metadata at the dataset
level. This is also referred to as the package
in some CKAN documentation. This is the main
table/worksheet in the SQLite database and XLSX output. 4. Resources Metadata contains the metadata for the resources contained within each dataset. 5. resource views metadata contains the metadata for the views applied to each resource, if a resource has a view configured. 6. datastore fields metadata contains the DataStore information for CSV datasets that have been loaded into the DataStore. This information is displayed in the Data Dictionary for DataStore enabled CSVs. 7. Data Package Fields contains a description of the fields available in each of the tables within the Catalogue, as well as the count of the number of records each table contains. 8. data package entity relation diagram Displays the title and format for column, in each table in the Data Package in the form of a ERD Diagram. The Data Package resource offers a text based version. 9. SQLite Database is a .db
database, similar in structure to Catalogue. This can be queried with database or analytical software tools for doing analysis.
The open dataset, software, and other files accompanying the manuscript "An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models," submitted for publication to Integrated Materials and Manufacturing Innovations.Machine learning and autonomy are increasingly prevalent in materials science, but existing models are often trained or tuned using idealized data as absolute ground truths. In actual materials science, "ground truth" is often a matter of interpretation and is more readily determined by consensus. Here we present the data, software, and other files for a study using as-obtained diffraction data as a test case for evaluating the performance of machine learning models in the presence of differing expert opinions. We demonstrate that experts with similar backgrounds can disagree greatly even for something as intuitive as using diffraction to identify the start and end of a phase transformation. We then use a logarithmic likelihood method to evaluate the performance of machine learning models in relation to the consensus expert labels and their variance. We further illustrate this method's efficacy in ranking a number of state-of-the-art phase mapping algorithms. We propose a materials data challenge centered around the problem of evaluating models based on consensus with uncertainty. The data, labels, and code used in this study are all available online at data.gov, and the interested reader is encouraged to replicate and improve the existing models or to propose alternative methods for evaluating algorithmic performance.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions
in Meta Kaggle. The file names match the ids in the KernelVersions
csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads
. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
The Human Know-How Dataset describes 211,696 human activities from many different domains. These activities are decomposed into 2,609,236 entities (each with an English textual label). These entities represent over two million actions and half a million pre-requisites. Actions are interconnected both according to their dependencies (temporal/logical orders between actions) and decompositions (decomposition of complex actions into simpler ones). This dataset has been integrated with DBpedia (259,568 links). For more information see: - The project website: http://homepages.inf.ed.ac.uk/s1054760/prohow/index.htm - The data is also available on datahub: https://datahub.io/dataset/human-activities-and-instructions ---------------------------------------------------------------- * Quickstart: if you want to experiment with the most high-quality data before downloading all the datasets, download the file '9of11_knowhow_wikihow', and optionally files 'Process - Inputs', 'Process - Outputs', 'Process - Step Links' and 'wikiHow categories hierarchy'. * Data representation based on the PROHOW vocabulary: http://w3id.org/prohow# Data extracted from existing web resources is linked to the original resources using the Open Annotation specification * Data Model: an example of how the data is represented within the datasets is available in the attached Data Model PDF file. The attached example represents a simple set of instructions, but instructions in the dataset can have more complex structures. For example, instructions could have multiple methods, steps could have further sub-steps, and complex requirements could be decomposed into sub-requirements. ---------------------------------------------------------------- Statistics: * 211,696: number of instructions. From wikiHow: 167,232 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 44,464 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 2,609,236: number of RDF nodes within the instructions From wikiHow: 1,871,468 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 737,768 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 255,101: number of process inputs linked to 8,453 distinct DBpedia concepts (dataset Process - Inputs) * 4,467: number of process outputs linked to 3,439 distinct DBpedia concepts (dataset Process - Outputs) * 376,795: number of step links between 114,166 different sets of instructions (dataset Process - Step Links)
Once PowerPivot has been installed, to load the large files, please follow the instructions below: Start Excel as normal Click on the PowerPivot tab Click on the PowerPivot Window icon (top left) In the PowerPivot Window, click on the "From Other Sources" icon In the Table Import Wizard e.g. scroll to the bottom and select Text File Browse to the file you want to open and choose the file extension you require e.g. CSV Please read the below notes to ensure correct understanding of the data. Microsoft PowerPivot add-on for Excel can be used to handle larger data sets. The Microsoft PowerPivot add-on for Excel is available using the link in the 'Related Links' section - https://www.microsoft.com/en-us/download/details.aspx?id=43348 Once PowerPivot has been installed, to load the large files, please follow the instructions below: 1. Start Excel as normal 2. Click on the PowerPivot tab 3. Click on the PowerPivot Window icon (top left) 4. In the PowerPivot Window, click on the "From Other Sources" icon 5. In the Table Import Wizard e.g. scroll to the bottom and select Text File 6. Browse to the file you want to open and choose the file extension you require e.g. CSV Please read the below notes to ensure correct understanding of the data. Fewer than 5 Items Please be aware that I have decided not to release the exact number of items, where the total number of items falls below 5, for certain drugs/patient combinations. Where suppression has been applied a * is shown in place of the number of items, please read this as 1-4 items. Suppressions have been applied where items are lower than 5, for items and NIC and for quantity when quantity and items are both lower than 5 for the following drugs and identified genders as per the sensitive drug list; When the BNF Paragraph Code is 60401 (Female Sex Hormones & Their Modulators) and the gender identified on the prescription is Male When the BNF Paragraph Code is 60402 (Male Sex Hormones And Antagonists) and the gender identified on the prescription is Female When the BNF Paragraph Code is 70201 (Preparations For Vaginal/Vulval Changes) and the gender identified on the prescription is Male When the BNF Paragraph Code is 70202 (Vaginal And Vulval Infections) and the gender identified on the prescription is Male When the BNF Paragraph Code is 70301 (Combined Hormonal Contraceptives/Systems) and the gender identified on the prescription is Male When the BNF Paragraph Code is 70302 (Progestogen-only Contraceptives) and the gender identified on the prescription is Male When the BNF Paragraph Code is 80302 (Progestogens) and the gender identified on the prescription is Male When the BNF Paragraph Code is 70405 (Drugs For Erectile Dysfunction) and the gender identified on the prescription is Female When the BNF Paragraph Code is 70406 (Drugs For Premature Ejaculation) and the gender identified on the prescription is Female This is because the patients could be identified, when combined with other information that may be in the public domain or reasonably available. This information falls under the exemption in section 40 subsections 2 and 3A (a) of the Freedom of Information Act. This is because it would breach the first data protection principle as: a. it is not fair to disclose patients personal details to the world and is likely to cause damage or distress. b. these details are not of sufficient interest to the public to warrant an intrusion into the privacy of the patients. Please click the below web link to see the exemption in full.
TRAINING DATASET: Hands-On Formatting Data Part 1 (Download This File)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This document includes description of the Tukes-avoindatajulkaisut (Tukes open data publications) file structure, in addition with instructions how to open the file content in Microsoft Excel.
City of Tempe Open Data Terms of Use document includes:Terms of UseData Rights and UsageSecondary UseRight to LimitChangesModeration NoticeDisclaimer of WarrantiesLimitations on LiabilityNo Waiver Rights
Tempe Open Data Change Management and Data Retention Policies includes:Assess the extent of the changePrepare updated metadata and data dictionary
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
The National Pollutant Release Inventory (NPRI) is Canada's public inventory of pollutant releases (to air, water and land), disposals and transfers for recycling. Each file contains data from 1993 to the latest reporting year. These CSV format datasets are in normalized or ‘list’ format and are optimized for pivot table analyses. Here is a description of each file: - The RELEASES file contains all substance release quantities. - The DISPOSALS file contains all on-site and off-site disposal quantities, including tailings and waste rock (TWR). - The TRANSFERS file contains all quantities transferred for recycling or treatment prior to disposal. - The COMMENTS file contains all the comments provided by facilities about substances included in their report. - The GEO LOCATIONS file contains complete geographic information for all facilities that have reported to the NPRI. Please consult the following resources to enhance your analysis: - Guide on using and Interpreting NPRI Data: https://www.canada.ca/en/environment-climate-change/services/national-pollutant-release-inventory/using-interpreting-data.html - Access additional data from the NPRI, including datasets and mapping products: https://www.canada.ca/en/environment-climate-change/services/national-pollutant-release-inventory/tools-resources-data/exploredata.html Supplemental Information More NPRI datasets and mapping products are available here: https://www.canada.ca/en/environment-climate-change/services/national-pollutant-release-inventory/tools-resources-data/access.html Supporting Projects: National Pollutant Release Inventory (NPRI)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The
The open datasets include two categories:
bibliographic dataset which includes a subset of data from ORCID registry (orcid.zip + orcid_data.zip) and a PubMed (pubmed.zip + pubmed_data.zip).
clinical trials databases: which consist of three subsets of data from EU Clinical Trials Register (euctr.zip + euctr_data.zip), CT.gov (ctgov.zip + ctgov_data.zip), and the German Trials (drks.zip + drks_data.zip).
N.B: For each dataset, there are two files available: one containing the SQL code required to create the database schema in PostgreSQL and another file with the format name "[dataset_name]_data.zip" that includes SQL insertion of a collection of data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Open Government Data (OGD) has the potential to support social and economic progress. However, this potential can be frustrated if this data remains unused. Although the literature suggests that OGD datasets' metadata quality is one of the main factors affecting their use, to the best of our knowledge, no quantitative study provided evidence of this relationship. Considering about 400,000 datasets of 28 national, municipal, and international OGD portals, we have programmatically analyzed their usage, their metadata quality, and the relationship between the two. Our analysis has highlighted three main findings. First of all, regardless of their size, the software platform adopted, and their administrative and territorial coverage, most OGD datasets are underutilized. Second, OGD portals pay varying attention to the quality of their datasets’ metadata. Third, we did not find clear evidence that datasets usage is positively correlated to better metadata publishing practices. Finally, we have considered other factors, such as datasets’ category, and some demographic characteristics of the OGD portals, and analyzed their relationship with datasets usage, obtaining partially affirmative answers.
The dataset consists of three zipped CSV files, containing the collected datasets' usage data, full metadata, and computed quality values, for about 400,000 datasets belonging to the 8 national, 4 international, and 16 US municipalities OGD portals considered in the study.
Data collection occurred in the period: 2019-12-19 -- 2019-12-23.
Portal #Datasets Platform
US 261,514 CKAN
France 39,412 Other
Colombia 9,795 Socrata
IE 9,598 CKAN
Slovenia 4,892 CKAN
Poland 1,032 Other
Latvia 336 CKAN
Puerto Rico 178 Socrata
New York, NY 2,771 Socrata
Baltimore, MD 2,617 Socrata
Austin, TX 2,353 Socrata
Chicago, IL 1,368 Socrata
San Francisco, CA 1,001 Socrata
Dallas, TX 1,001 Socrata
Los Angeles, CA 943 Socrata
Seattle, WA 718 Socrata
Providence, RI 288 Socrata
Honolulu, HI 244 Socrata
New Orleans, LA 215 Socrata
Buffalo, NY 213 Socrata
Nashville, TN 172 Socrata
Boston, MA 170 CKAN
Albuquerque, NM 60 CKAN
Albany, NY 50 Socrata
HDX 17,325 CKAN
EUODP 14,058 CKAN
NASA 9,664 Socrata
World Bank Finances 2,177 Socrata
The three datasets share the same table structure:
Table Fields
portalid: portal identifier
id: dataset identifier
engine: identifier of the supporting portal platform: 1(CKAN), 2 (Socrata)
admindomain: 1 (National), 2 (US), 3 (International)
downloaddate: date of data collection
views: number of total views for the dataset
downloads: number of total downloads for the dataset
overallq: overall quality values computed by applying the methodology presented by Neumaier et al. in [1]
qvalues: json object containing the quality values computed for the 17 metrics presented in by Neumaier et al. [1]
assessdate: date of quality assessment
metadata: the overall dataset's metadata downloaded via API from the portal according to the supporting platform schema
[1] Neumaier, S.; Umbrich, J.; Polleres, A. Automated Quality Assessment of Metadata Across Open Data Portals.J. Data and Information Quality2016,8, 2:1–2:29. doi:10.1145/2964909
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a study to assess the application of process mining techniques on data from the Brazilian public services, made available on open data portals, aiming to identify bottlenecks and improvement opportunities in government processes. The datasets were obtained from the Brazilian Federal Government's Open Data Portal: dados.govCategorization:(1) event log(2) there is a complete date(3) list of data or information table(4) documents(5) no file founded(6) link to another portalLink of brasilian portal: https://dados.gov.br/homeList of content made available:open-data-sample.zip: all the files obtained from the representative sample of the studyopen-data-sample.xls: table categorizing the datasets obtained and classifying them as relevant for testing in the process mining toolsdataset137.csv: dataset with undergraduate degree records tested in the Disco, Celonis and ProM toolsdataset258.csv: dataset with software registration requests tested in the Disco, Celonis and ProM toolsdataset356.csv: dataset with public tender inspector registrations tested in the Disco, Celonis and ProM tools
The Minimum Data Set (MDS) Frequency data summarizes health status indicators for active residents currently in nursing homes. The MDS is part of the Federally-mandated process for clinical assessment of all residents in Medicare and Medicaid certified nursing homes. This process provides a comprehensive assessment of each resident's functional capabilities and helps nursing home staff identify health problems. Care Area Assessments (CAAs) are part of this process, and provide the foundation upon which a resident's individual care plan is formulated. MDS assessments are completed for all residents in certified nursing homes, regardless of source of payment for the individual resident. MDS assessments are required for residents on admission to the nursing facility, periodically, and on discharge. All assessments are completed within specific guidelines and time frames. In most cases, participants in the assessment process are licensed health care professionals employed by the nursing home. MDS information is transmitted electronically by nursing homes to the national MDS database at CMS. When reviewing the MDS 3.0 Frequency files, some common software programs e.g., ‘Microsoft Excel’ might inaccurately strip leading zeros from designated code values (i.e., "01" becomes "1") or misinterpret code ranges as dates (i.e., O0600 ranges such as 02-04 are misread as 04-Feb). As each piece of software is unique, if you encounter an issue when reading the CSV file of Frequency data, please open the file in a plain text editor such as ‘Notepad’ or ‘TextPad’ to review the underlying data, before reaching out to CMS for assistance.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Objective(s): Data sharing has enormous potential to accelerate and improve the accuracy of research, strengthen collaborations, and restore trust in the clinical research enterprise. Nevertheless, there remains reluctancy to openly share raw data sets, in part due to concerns regarding research participant confidentiality and privacy. We provide an instructional video to describe a standardized de-identification framework that can be adapted and refined based on specific context and risks. Data Description: Training video, presentation slides. Related Resources: The data de-identification algorithm, dataset, and data dictionary that correspond with this training video are available through the Smart Triage sub-Dataverse. NOTE for restricted files: If you are not yet a CoLab member, please complete our membership application survey to gain access to restricted files within 2 business days. Some files may remain restricted to CoLab members. These files are deemed more sensitive by the file owner and are meant to be shared on a case-by-case basis. Please contact the CoLab coordinator on this page under "collaborate with the pediatric sepsis colab."
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Canada Trademarks Dataset
18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303
Dataset Selection and Arrangement (c) 2021 Jeremy Sheff
Python and Stata Scripts (c) 2021 Jeremy Sheff
Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.
This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.
Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.
Terms of Use:
As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.
The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:
The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.
Details of Repository Contents:
This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:
If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.
The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.
With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.
The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.
This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.