Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Definitions and examples of the moves of the UPOCS genre
Facebook
TwitterUse this summary report to properly interpret 2021 NSDUH estimates of substance use and mental health issues. The report accompanies theannual detailed tablesand covers overall methodology, key definitions for measures and terms used in 2021 NSDUH reports and tables, and selected analyses of the measures and how they should be interpreted.The report is organized into six chapters:Introduction.Description of the survey, including information about the sample design, data collection procedures, and key aspects of data processing such as development of the analysis weights. The report also includes methodological changes and related issues in the 2021 NSDUH due to COVID-19.Technical details on the statistical methods and measurement, such as suppression criteria for unreliable estimates, statistical testing procedures, issues around selected substance use and mental health measures, and the impact of methodological changes on response rates.Special topics related to prescription psychotherapeutic drugs.A comparison between NSDUH and other sources of data on substance use and mental health issues, including data sources for populations outside the NSDUH target population.A more in-depth view of special methodological issues for the 2021 NSDUH, including the results of special analyses that led SAMHSA to not compare estimates from 2021 to estimates from previous years.An appendix covers key definitions used in NSDUH reports and tables.
Facebook
TwitterDefinitions of independent variables used in the statistical analysis.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This data was collected and analyzed as part of a study on PII disclosures in social media conversations with special attention to influencer characteristics in the interactions in the dissertation titled Privacy vs. Social Capital: Examining Information Disclosure Patterns within Social Media Influencer Networks and the research paper titled Unveiling Influencer-Driven Personal Data Sharing in Social Media Discourse.
Each study phase is different, with X (Twitter) data used in the pilot analysis and Reddit data used in the main study. Both folders will have the analyzed_posts and cluster summary csv files broken down by collection (either based on trend or collection date).
Note: Raw data is not made available in these datasets due to the nature of the study and to protect the original authors.
| Column name | Type | Description |
|---|---|---|
| Node ID | UUID | Unique identifier for post (replaces original platform identifier) |
| User ID | UUID | Unique identifier assigned for user (replaces original platform identifier) |
| Cluster Name | Str | Composite ID for subgraph using collection name and subgraph index |
| Influence Power | Float | Eigenvector centrality |
| Influencer Tier | Str | Categorical label calculated by follower count |
| Collection Name | Str | Trend collection assigned based on search query |
| Hashtags | Set(str) | The set of hashtags included in the node |
| PII Disclosed | Bool | Whether or not PII was disclosed |
| PII Detected | Set(str) | The detected token types in post |
| PII Risk Score | Float | The PII score for all tokens in a post |
| Is Comment | Bool | Whether or not the post is a comment or reply |
| Is Text Starter | Bool | Whether or not the post has text content |
| Community | Str | The group, community, channel, etc. associated with |
| Timestamp | Timestamp | Creation timestamp (provided by social media API) |
| Time Elapsed | Int | Time elapsed (seconds) from original influencer’s post |
| Column Name | Type | Description |
|---|---|---|
| Cluster Name | Str | Composite ID for subgraph using collection name and subgraph index |
| Influencer Tiers Frequencies | List[dict] | Frequency of influencer tiers of all users in the cluster |
| Top Influence Power Score | Float | Eigenvector centrality of top influencer |
| Top Influencer Tier | Str | Size tier of top influencer |
| Collection Name | Str | Trend collection assigned based on search query. |
| Hashtags | Set(str) | The set of hashtags included in the cluster |
| PII Detection Frequencies | List[dict] | The detected token types in post with frequencies |
| Node Count | Int | Count of all nodes in the influencer cluster |
| Node Disclosures | Int | Count of all nodes with mean_risk_score > 1* |
| Disclosure Ratio | Float | Sum of nodes with confirmed disclosed PII divided by overall cluster size (count of nodes in the cluster) |
| Mean Risk Score | Float | The mean risk score for an entire network cluster |
| Median Risk Score | Float | The median risk score for an entire network cluster |
| Min Risk Score | Float | The min risk score for an entire network cluster |
| Max Risk Score | Float | The max risk score for an entire network cluster |
| Time Span | Float | Total Time Elapsed |
Facebook
TwitterThe objective of this study was to identify the patterns of juvenile salmonid distribution and relative abundance in relation to habitat correlates. It is the first dataset of its kind because the entire river was snorkeled by one person in multiple years. During two consecutive summers, we completed a census of juvenile salmonids and stream habitat across a stream network. We used the data to test the ability of habitat models to explain the distribution of juvenile coho salmon (Oncorhynchus kisutch), young-of-the-year (age 0) steelhead (Oncorhynchus mykiss), and steelhead parr (= age 1) for a network consisting of several different sized streams. Our network-scale models, which included five stream habitat variables, explained 27%, 11%, and 19% of the variation in the density of juvenile coho salmon, age 0 steelhead, and steelhead parr, respectively. We found weak to strong levels of spatial auto-correlation in the model residuals (Moran's I values ranging from 0.25 - 0.71). Explanatory power of base habitat models increased substantially and the level of spatial auto-correlation decreased with sequential inclusion of variables accounting for stream size, year, stream, and reach location. The models for specific streams underscored the variability that was implied in the network-scale models. Associations between juvenile salmonids and individual habitat variables were rarely linear and ranged from negative to positive, and the variable accounting for location of the habitat within a stream was often more important than any individual habitat variable. The limited success in predicting the summer distribution and density of juvenile coho salmon and steelhead with our network-scale models was apparently related to variation in the strength and shape of fish-habitat associations across and within streams and years. Summary of statistical analysis of the Calawah Riverscape data. NOAA was not involved and did not pay for the collection of this data. This data represents the statistical analysis carried out by Martin Liermann as a NOAA employee.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Neural decoding is a powerful method to analyze neural activity. However, the code needed to run a decoding analysis can be complex, which can present a barrier to using the method. In this paper we introduce a package that makes it easy to perform decoding analyses in the R programing language. We describe how the package is designed in a modular fashion which allows researchers to easily implement a range of different analyses. We also discuss how to format data to be able to use the package, and we give two examples of how to use the package to analyze real data. We believe that this package, combined with the rich data analysis ecosystem in R, will make it significantly easier for researchers to create reproducible decoding analyses, which should help increase the pace of neuroscience discoveries.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AbstractThis article aimed to discuss the different definitions of slums and favelas and their implication on population data. The definitions discussed were extracted from research related to the United Nations Human Settlements Programme (UN-Habitat) and the Instituto Brasileiro de Geografia e Estatística (IBGE). The data manipulation was performed according to the content analysis (CA) approach. The quantification performed with Iramuteq software was based on word frequency and factorial correspondence analysis (FCA). Qualitative and quantitative analyzes highlighted two major differences: in the object characterization (area, building and both); and qualification type (legal aspects, construction standards, infrastructure deficiency, land property, population density, geographic references and residents typing). With the high number of qualifications and diverse content, the population data aggregate different information, making its comparison less accurate. This imprecision tends to expand due to the area growth and the number of countries analyzed.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This lesson was adapted from educational material written by Dr. Kateri Salk for her Fall 2019 Hydrologic Data Analysis course at Duke University. This is the first part of a two-part exercise focusing on time series analysis.
Introduction
Time series are a special class of dataset, where a response variable is tracked over time. The frequency of measurement and the timespan of the dataset can vary widely. At its most simple, a time series model includes an explanatory time component and a response variable. Mixed models can include additional explanatory variables (check out the nlme and lme4 R packages). We will be covering a few simple applications of time series analysis in these lessons.
Opportunities
Analysis of time series presents several opportunities. In aquatic sciences, some of the most common questions we can answer with time series modeling are:
Can we forecast conditions in the future?
Challenges
Time series datasets come with several caveats, which need to be addressed in order to effectively model the system. A few common challenges that arise (and can occur together within a single dataset) are:
Autocorrelation: Data points are not independent from one another (i.e., the measurement at a given time point is dependent on previous time point(s)).
Data gaps: Data are not collected at regular intervals, necessitating interpolation between measurements. There are often gaps between monitoring periods. For many time series analyses, we need equally spaced points.
Seasonality: Cyclic patterns in variables occur at regular intervals, impeding clear interpretation of a monotonic (unidirectional) trend. Ex. We can assume that summer temperatures are higher.
Heteroscedasticity: The variance of the time series is not constant over time.
Covariance: the covariance of the time series is not constant over time. Many of these models assume that the variance and covariance are similar over the time-->heteroschedasticity.
Learning Objectives
After successfully completing this notebook, you will be able to:
Choose appropriate time series analyses for trend detection and forecasting
Discuss the influence of seasonality on time series analysis
Interpret and communicate results of time series analyses
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT Companies are encouraged by the big data trend to experiment with advanced analytics and many turn to specialist consultancies to help them get started where they lack the necessary competences. We investigate the program of one such consultancy, Advectas - in particular the advanced analytics Jumpstart. Using qualitative techniques including semi structured interviews and content analysis we investigate the nature and value of the Jumpstart concept through five cases in different companies. We provide a definition, a process model and a set of thirteen best practices derived from these experiences, and discuss the distinctive qualities of this approach.
Facebook
TwitterUse this summary report to properly interpret 2022 NSDUH estimates related to substance use, mental health, and treatment. The report accompanies theannual detailed tablesand covers overall methodology, key definitions for measures and terms used in 2022 NSDUH reports and tables, and selected analyses of the measures and how they should be interpreted.The report is organized into five chapters:Introduction.Description of the survey, including information about the sample design, data collection procedures and questionnaire changes, and key aspects of data processing such as development of the analysis weights.Technical details on the statistical methods and measurement, such as suppression criteria for unreliable estimates, statistical testing procedures, revised estimates for 2021 to account for data collection mode, and issues around selected substance use and mental health measures.Special topics related to prescription psychotherapeutic drugs.Description of other sources of data on substance use and mental health issues in the United States, including data sources for populations outside the NSDUH target population.An appendix covers key definitions used in NSDUH reports and tables.
Facebook
TwitterUse this summary report to properly interpret 2019 NSDUH estimates of substance use and mental health issues. The report accompanies theannual detailed tablesand covers overall methodology, key definitions for measures and terms used in 2019 NSDUH reports and tables, and selected analyses of the measures and how they should be interpreted.The report is organized into five chapters:Introduction.Description of the survey, including information about the sample design, data collection procedures, and key aspects of data processing such as development of the analysis weights.Technical details on the statistical methods and measurement, such as suppression criteria for unreliable estimates, statistical testing procedures, issues around data accuracy, and measurement issues for selected substance use and mental health measures.Special topics related to prescription psychotherapeutic drugs.A comparison between NSDUH and other sources of data on substance use and mental health issues, including data sources for populations outside the NSDUH target population.An appendix covers key definitions used in NSDUH reports and tables.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data collected during a study "Understanding the development of public data ecosystems: from a conceptual model to a six-generation model of the evolution of public data ecosystems" conducted by Martin Lnenicka (University of Hradec Králové, Czech Republic), Anastasija Nikiforova (University of Tartu, Estonia), Mariusz Luterek (University of Warsaw, Warsaw, Poland), Petar Milic (University of Pristina - Kosovska Mitrovica, Serbia), Daniel Rudmark (Swedish National Road and Transport Research Institute, Sweden), Sebastian Neumaier (St. Pölten University of Applied Sciences, Austria), Karlo Kević (University of Zagreb, Croatia), Anneke Zuiderwijk (Delft University of Technology, Delft, the Netherlands), Manuel Pedro Rodríguez Bolívar (University of Granada, Granada, Spain).
As there is a lack of understanding of the elements that constitute different types of value-adding public data ecosystems and how these elements form and shape the development of these ecosystems over time, which can lead to misguided efforts to develop future public data ecosystems, the aim of the study is: (1) to explore how public data ecosystems have developed over time and (2) to identify the value-adding elements and formative characteristics of public data ecosystems. Using an exploratory retrospective analysis and a deductive approach, we systematically review 148 studies published between 1994 and 2023. Based on the results, this study presents a typology of public data ecosystems and develops a conceptual model of elements and formative characteristics that contribute most to value-adding public data ecosystems, and develops a conceptual model of the evolutionary generation of public data ecosystems represented by six generations called Evolutionary Model of Public Data Ecosystems (EMPDE). Finally, three avenues for a future research agenda are proposed.
This dataset is being made public both to act as supplementary data for "Understanding the development of public data ecosystems: from a conceptual model to a six-generation model of the evolution of public data ecosystems ", Telematics and Informatics*, and its Systematic Literature Review component that informs the study.
Description of the data in this data set
PublicDataEcosystem_SLR provides the structure of the protocol
Spreadsheet#1 provides the list of results after the search over three indexing databases and filtering out irrelevant studies
Spreadsheets #2 provides the protocol structure.
Spreadsheets #3 provides the filled protocol for relevant studies.
The information on each selected study was collected in four categories:(1) descriptive information,(2) approach- and research design- related information,(3) quality-related information,(4) HVD determination-related information
Descriptive Information
Article number
A study number, corresponding to the study number assigned in an Excel worksheet
Complete reference
The complete source information to refer to the study (in APA style), including the author(s) of the study, the year in which it was published, the study's title and other source information.
Year of publication
The year in which the study was published.
Journal article / conference paper / book chapter
The type of the paper, i.e., journal article, conference paper, or book chapter.
Journal / conference / book
Journal article, conference, where the paper is published.
DOI / Website
A link to the website where the study can be found.
Number of words
A number of words of the study.
Number of citations in Scopus and WoS
The number of citations of the paper in Scopus and WoS digital libraries.
Availability in Open Access
Availability of a study in the Open Access or Free / Full Access.
Keywords
Keywords of the paper as indicated by the authors (in the paper).
Relevance for our study (high / medium / low)
What is the relevance level of the paper for our study
Approach- and research design-related information
Approach- and research design-related information
Objective / Aim / Goal / Purpose & Research Questions
The research objective and established RQs.
Research method (including unit of analysis)
The methods used to collect data in the study, including the unit of analysis that refers to the country, organisation, or other specific unit that has been analysed such as the number of use-cases or policy documents, number and scope of the SLR etc.
Study’s contributions
The study’s contribution as defined by the authors
Qualitative / quantitative / mixed method
Whether the study uses a qualitative, quantitative, or mixed methods approach?
Availability of the underlying research data
Whether the paper has a reference to the public availability of the underlying research data e.g., transcriptions of interviews, collected data etc., or explains why these data are not openly shared?
Period under investigation
Period (or moment) in which the study was conducted (e.g., January 2021-March 2022)
Use of theory / theoretical concepts / approaches? If yes, specify them
Does the study mention any theory / theoretical concepts / approaches? If yes, what theory / concepts / approaches? If any theory is mentioned, how is theory used in the study? (e.g., mentioned to explain a certain phenomenon, used as a framework for analysis, tested theory, theory mentioned in the future research section).
Quality-related information
Quality concerns
Whether there are any quality concerns (e.g., limited information about the research methods used)?
Public Data Ecosystem-related information
Public data ecosystem definition
How is the public data ecosystem defined in the paper and any other equivalent term, mostly infrastructure. If an alternative term is used, how is the public data ecosystem called in the paper?
Public data ecosystem evolution / development
Does the paper define the evolution of the public data ecosystem? If yes, how is it defined and what factors affect it?
What constitutes a public data ecosystem?
What constitutes a public data ecosystem (components & relationships) - their "FORM / OUTPUT" presented in the paper (general description with more detailed answers to further additional questions).
Components and relationships
What components does the public data ecosystem consist of and what are the relationships between these components? Alternative names for components - element, construct, concept, item, helix, dimension etc. (detailed description).
Stakeholders
What stakeholders (e.g., governments, citizens, businesses, Non-Governmental Organisations (NGOs) etc.) does the public data ecosystem involve?
Actors and their roles
What actors does the public data ecosystem involve? What are their roles?
Data (data types, data dynamism, data categories etc.)
What data do the public data ecosystem cover (is intended / designed for)? Refer to all data-related aspects, including but not limited to data types, data dynamism (static data, dynamic, real-time data, stream), prevailing data categories / domains / topics etc.
Processes / activities / dimensions, data lifecycle phases
What processes, activities, dimensions and data lifecycle phases (e.g., locate, acquire, download, reuse, transform, etc.) does the public data ecosystem involve or refer to?
Level (if relevant)
What is the level of the public data ecosystem covered in the paper? (e.g., city, municipal, regional, national (=country), supranational, international).
Other elements or relationships (if any)
What other elements or relationships does the public data ecosystem consist of?
Additional comments
Additional comments (e.g., what other topics affected the public data ecosystems and their elements, what is expected to affect the public data ecosystems in the future, what were important topics by which the period was characterised etc.).
New papers
Does the study refer to any other potentially relevant papers?
Additional references to potentially relevant papers that were found in the analysed paper (snowballing).
Format of the file.xls, .csv (for the first spreadsheet only), .docx
Licenses or restrictionsCC-BY
For more info, see README.txt
Facebook
TwitterThe Dictionary of Algorithms and Data Structures (DADS) is an online, publicly accessible dictionary of generally useful algorithms, data structures, algorithmic techniques, archetypal problems, and related definitions. In addition to brief definitions, some entries have links to related entries, links to implementations, and additional information. DADS is meant to be a resource for the practicing programmer, although students and researchers may find it a useful starting point. DADS has fundamental entries in areas such as theory, cryptography and compression, graphs, trees, and searching, for instance, Ackermann's function, quick sort, traveling salesman, big O notation, merge sort, AVL tree, hash table, and Byzantine generals. DADS also has index pages that list entries by area and by type. Currently DADS does not include algorithms particular to business data processing, communications, operating systems or distributed algorithms, programming languages, AI, graphics, or numerical analysis.
Facebook
TwitterThe zebrafish Danio rerio has become a popular model host to explore disease pathology caused by infectious agents. A main advantage is its transparency at an early age, which enables live imaging of infection dynamics. While multispecies infections are common in patients, the zebrafish model is rarely used to study them, although the model would be ideal for investigating pathogen-pathogen and pathogen-host interactions. This may be due to the absence of an established multispecies infection protocol for a defined organ and the lack of suitable image analysis pipelines for automated image processing. To address these issues, we developed a protocol for establishing and tracking single and multispecies bacterial infections in the inner ear structure (otic vesicle) of the zebrafish by imaging. Subsequently, we generated an image analysis pipeline that involved deep learning for the automated segmentation of the otic vesicle, and scripts for quantifying pathogen frequencies through fluorescence intensity measures. We used Pseudomonas aeruginosa, Acinetobacter baumannii, and Klebsiella pneumoniae, three of the difficult-to-treat ESKAPE pathogens, to show that our infection protocol and image analysis pipeline work both for single pathogens and pairwise pathogen combinations. Thus, our protocols provide a comprehensive toolbox for studying single and multispecies infections in real-time in zebrafish.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a detailed collection of information related to [your topic], offering valuable insights for data analysis, visualization, and model development. It consists of multiple features such as [list of important columns], which capture various dimensions of the subject in a structured and measurable way.
The purpose of this dataset is to support exploratory data analysis (EDA) and predictive modeling by allowing users to identify trends, patterns, and relationships among variables. It can serve as a foundation for building machine learning models, performing statistical studies, or generating data-driven visual reports.
Researchers, data enthusiasts, and students can use this dataset to enhance their analytical understanding, practice preprocessing techniques, and improve their ability to draw meaningful conclusions from real-world data.
Additionally, this dataset can be explored to uncover correlations, test hypotheses, and visualize behavioral or performance patterns. Its clean structure and well-defined variables make it suitable for both beginners learning EDA and experienced professionals developing predictive insights.
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Supporting data for the results of Interlaboratory 1 of the Method Assessment for Non-Targeted Analyses. The datasets include the chemical compound descriptions, laboratory mean responses, and the tools for the principal components analysis of the datasets. In addition, a Microsoft Excel file, which was given to all participants, allowed for the analysis of the metadata.
Facebook
TwitterDescription: The COVID-19 dataset used for this EDA project encompasses comprehensive data on COVID-19 cases, deaths, and recoveries worldwide. It includes information gathered from authoritative sources such as the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), and national health agencies. The dataset covers global, regional, and national levels, providing a holistic view of the pandemic's impact.
Purpose: This dataset is instrumental in understanding the multifaceted impact of the COVID-19 pandemic through data exploration. It aligns perfectly with the objectives of the EDA project, aiming to unveil insights, patterns, and trends related to COVID-19. Here are the key objectives: 1. Data Collection and Cleaning: • Gather reliable COVID-19 datasets from authoritative sources (such as WHO, CDC, or national health agencies). • Clean and preprocess the data to ensure accuracy and consistency. 2. Descriptive Statistics: • Summarize key statistics: total cases, recoveries, deaths, and testing rates. • Visualize temporal trends using line charts, bar plots, and heat maps. 3. Geospatial Analysis: • Map COVID-19 cases across countries, regions, or cities. • Identify hotspots and variations in infection rates. 4. Demographic Insights: • Explore how age, gender, and pre-existing conditions impact vulnerability. • Investigate disparities in infection rates among different populations. 5. Healthcare System Impact: • Analyze hospitalization rates, ICU occupancy, and healthcare resource allocation. • Assess the strain on medical facilities. 6. Economic and Social Effects: • Investigate the relationship between lockdown measures, economic indicators, and infection rates. • Explore behavioral changes (e.g., mobility patterns, remote work) during the pandemic. 7. Predictive Modeling (Optional): • If data permits, build simple predictive models (e.g., time series forecasting) to estimate future cases.
Data Sources: The primary sources of the COVID-19 dataset include the Johns Hopkins CSSE COVID-19 Data Repository, Google Health’s COVID-19 Open Data, and the U.S. Economic Development Administration (EDA). These sources provide reliable and up-to-date information on COVID-19 cases, deaths, testing rates, and other relevant variables. Additionally, GitHub repositories and platforms like Medium host supplementary datasets and analyses, enriching the available data resources.
Data Format: The dataset is available in various formats, such as CSV and JSON, facilitating easy access and analysis. Before conducting the EDA, the data underwent preprocessing steps to ensure accuracy and consistency. Data cleaning procedures were performed to address missing values, inconsistencies, and outliers, enhancing the quality and reliability of the dataset.
License: The COVID-19 dataset may be subject to specific usage licenses or restrictions imposed by the original data sources. Proper attribution is essential to acknowledge the contributions of the WHO, CDC, national health agencies, and other entities providing the data. Users should adhere to any licensing terms and usage guidelines associated with the dataset.
Attribution: We acknowledge the invaluable contributions of the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), national health agencies, and other authoritative sources in compiling and disseminating the COVID-19 data used for this EDA project. Their efforts in collecting, curating, and sharing data have been instrumental in advancing our understanding of the pandemic and guiding public health responses globally.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Canada Trademarks Dataset
18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303
Dataset Selection and Arrangement (c) 2021 Jeremy Sheff
Python and Stata Scripts (c) 2021 Jeremy Sheff
Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.
This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.
Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.
Terms of Use:
As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.
The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:
The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.
Details of Repository Contents:
This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:
If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.
The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.
With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.
The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.
This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These data refer to the paper “Upcoming issues, new methods: using Interactive Qualitative Analysis (IQA) in Management Research”. This article is a guide to the application of the IQA method in management research and the files available refer to: 1. 1-Affinities, definitions, and cards produced by focus group.docx: all cards, affinities and definitions create by focus group session.docx 2. 2-Step-by-step - Analysis procedures.docx: detailed data analysis procedures.docx 3. 3-Axial Coding Tables – Individual Interviews.docx: detailed axial coding procedures.docx 4. 4-Theoretical Coding Table – Individual Interviews.docx: detailed theoretical coding procedures.docx
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Overview The OYO Hotel Rooms Dataset provides comprehensive data on hotel room listings from OYO, covering various attributes related to pricing, amenities, and customer ratings. This dataset is valuable for researchers, data scientists, and machine learning practitioners interested in hospitality analytics, price prediction, customer satisfaction analysis, and clustering-based insights.
Data Source The dataset has been collected from publicly available OYO hotel listings and includes structured information for analysis.
Features The dataset consists of multiple attributes that define each hotel room, including:
Hotel Name: The name of the hotel property. City: The location where the hotel is situated. Room Type: Category of the room (e.g., Standard, Deluxe, Suite). Price (INR): The cost per night in Indian Rupees. Discounted Price: The price after applying discounts. Rating: The customer rating for the hotel (out of 5). Reviews: The number of customer reviews. Amenities: A list of available facilities such as WiFi, AC, Breakfast, Parking, etc. Latitude & Longitude: Geolocation details for mapping and spatial analysis. Potential Use Cases Price Prediction: Using regression models to predict hotel room pricing. Customer Sentiment Analysis: Analyzing ratings and reviews to understand customer satisfaction. Market Segmentation: Clustering hotels based on price, rating, and location. Recommendation Systems: Building personalized hotel recommendations. File Format
OYO_HOTEL_ROOMS.xlsx (Excel format) – Contains structured tabular data.
Acknowledgment This dataset is intended for academic and research purposes. The data is sourced from publicly available hotel listings and does not contain any personally identifiable information.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Definitions and examples of the moves of the UPOCS genre