61 datasets found
  1. The Canada Trademarks Dataset

    • zenodo.org
    pdf, zip
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeremy Sheff; Jeremy Sheff (2024). The Canada Trademarks Dataset [Dataset]. http://doi.org/10.5281/zenodo.4999655
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jeremy Sheff; Jeremy Sheff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Canada
    Description

    The Canada Trademarks Dataset

    18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303

    Dataset Selection and Arrangement (c) 2021 Jeremy Sheff

    Python and Stata Scripts (c) 2021 Jeremy Sheff

    Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.

    This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.

    Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.

    Terms of Use:

    As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.

    The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:

    The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.

    Details of Repository Contents:

    This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:

    • /csv: contains the .csv versions of the data files
    • /do: contains Stata do-files used to convert the .csv files to .dta format and perform the statistical analyses set forth in the paper reporting this dataset
    • /dta: contains the .dta versions of the data files
    • /py: contains the python scripts used to download CIPO’s historical trademarks data via SFTP and generate the .csv data files

    If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.

    The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.

    With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.

    The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.

    This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.

  2. d

    Leadbook B2B Contact Data Global Coverage, 200 Million Real-Time Generated...

    • datarade.ai
    .csv
    Updated May 18, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leadbook (2021). Leadbook B2B Contact Data Global Coverage, 200 Million Real-Time Generated Contacts with Custom Filter [Dataset]. https://datarade.ai/data-products/leadbook-b2b-contact-data-global-coverage-200-million-busine-leadbook
    Explore at:
    .csvAvailable download formats
    Dataset updated
    May 18, 2021
    Dataset authored and provided by
    Leadbook
    Area covered
    Virgin Islands (British), Bangladesh, Kiribati, Martinique, Mozambique, Finland, Curaçao, Slovenia, United Arab Emirates, Belize
    Description

    Build and customise datasets to match your target audience profile, from a database of 200 million global contacts generated in real-time. Get business contact information that's verified by Leadbook's proprietary A.I. powered data technology.

    Our Industry data enables you to reach the prospects and maximize your sales and revenue by offering the most impeccable data. Our data covers several industries that provide result-oriented records to help you build and grow business. Our industry-wise data is a vast repository of verified and opt-in contacts.

    Executives and Professionals Contact Data to connect with prospects to effectively market B2B products and services. All of our email addresses come with a 97% deliverability or better guarantee.

    Simply specify location, industry, employee headcount, job function and/or seniority attributes, then the platform will verify in real-time their business contact information, and you can download the records in a CSV file.

    All records include: - Contact name - Job title - Contact email address - Contact location - Contact LinkedIn URL - Organisation name - Organisation website - Organisation type - Organisation headcount - Primary industry

    Additional information like organization phone numbers, organization address, business registration number and secondary industries may be provided where available.

    Price starts from USD 0.40 per contact rent & USD 0.80 per contact purchase. Bulk discounts apply.

  3. riiid_train_converted to Multiple Formats

    • kaggle.com
    Updated Jun 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santh Raul (2021). riiid_train_converted to Multiple Formats [Dataset]. https://www.kaggle.com/santhraul/riiid-train-converted-to-multiple-formats/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 2, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Santh Raul
    Description

    Context

    Train data of Riiid competition is a large dataset of over 100 million rows and 10 columns that does not fit into Kaggle Notebook's RAM using the default pandas read.csv resulting in a search for alternative approaches and formats.

    Content

    Train data of Riiid competition in different formats.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    reading .CSV file for riiid completion took huge time and memory. This inspired me to convert .CSV in to different file format so that those can be loaded easily to Kaggle kernel.

  4. Z

    Data from: T1DiabetesGranada: a longitudinal multi-modal dataset of type 1...

    • data.niaid.nih.gov
    • produccioncientifica.ugr.es
    Updated Feb 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Munoz-Torres, Manuel (2024). T1DiabetesGranada: a longitudinal multi-modal dataset of type 1 diabetes mellitus [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10050943
    Explore at:
    Dataset updated
    Feb 2, 2024
    Dataset provided by
    Quesada-Charneco, Miguel
    Aviles Perez, Maria Dolores
    Munoz-Torres, Manuel
    Villalonga, Claudia
    Rodriguez-Leon, Ciro
    Lopez-Ibarra, Pablo J
    Banos, Oresti
    Description

    T1DiabetesGranada

    A longitudinal multi-modal dataset of type 1 diabetes mellitus

    Documented by:

    Rodriguez-Leon, C., Aviles-Perez, M. D., Banos, O., Quesada-Charneco, M., Lopez-Ibarra, P. J., Villalonga, C., & Munoz-Torres, M. (2023). T1DiabetesGranada: a longitudinal multi-modal dataset of type 1 diabetes mellitus. Scientific Data, 10(1), 916. https://doi.org/10.1038/s41597-023-02737-4

    Background

    Type 1 diabetes mellitus (T1D) patients face daily difficulties in keeping their blood glucose levels within appropriate ranges. Several techniques and devices, such as flash glucose meters, have been developed to help T1D patients improve their quality of life. Most recently, the data collected via these devices is being used to train advanced artificial intelligence models to characterize the evolution of the disease and support its management. The main problem for the generation of these models is the scarcity of data, as most published works use private or artificially generated datasets. For this reason, this work presents T1DiabetesGranada, a open under specific permission longitudinal dataset that not only provides continuous glucose levels, but also patient demographic and clinical information. The dataset includes 257780 days of measurements over four years from 736 T1D patients from the province of Granada, Spain. This dataset progresses significantly beyond the state of the art as one the longest and largest open datasets of continuous glucose measurements, thus boosting the development of new artificial intelligence models for glucose level characterization and prediction.

    Data Records

    The data are stored in four comma-separated values (CSV) files which are available in T1DiabetesGranada.zip. These files are described in detail below.

    Patient_info.csv

    Patient_info.csv is the file containing information about the patients, such as demographic data, start and end dates of blood glucose level measurements and biochemical parameters, number of biochemical parameters or number of diagnostics. This file is composed of 736 records, one for each patient in the dataset, and includes the following variables:

    Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

    Sex – Sex of the patient. Values: F (for female), masculine (for male)

    Birth_year – Year of birth of the patient. Format: YYYY.

    Initial_measurement_date – Date of the first blood glucose level measurement of the patient in the Glucose_measurements.csv file. Format: YYYY-MM-DD.

    Final_measurement_date – Date of the last blood glucose level measurement of the patient in the Glucose_measurements.csv file. Format: YYYY-MM-DD.

    Number_of_days_with_measures – Number of days with blood glucose level measurements of the patient, extracted from the Glucose_measurements.csv file. Values: ranging from 8 to 1463.

    Number_of_measurements – Number of blood glucose level measurements of the patient, extracted from the Glucose_measurements.csv file. Values: ranging from 400 to 137292.

    Initial_biochemical_parameters_date – Date of the first biochemical test to measure some biochemical parameter of the patient, extracted from the Biochemical_parameters.csv file. Format: YYYY-MM-DD.

    Final_biochemical_parameters_date – Date of the last biochemical test to measure some biochemical parameter of the patient, extracted from the Biochemical_parameters.csv file. Format: YYYY-MM-DD.

    Number_of_biochemical_parameters – Number of biochemical parameters measured on the patient, extracted from the Biochemical_parameters.csv file. Values: ranging from 4 to 846.

    Number_of_diagnostics – Number of diagnoses realized to the patient, extracted from the Diagnostics.csv file. Values: ranging from 1 to 24.

    Glucose_measurements.csv

    Glucose_measurements.csv is the file containing the continuous blood glucose level measurements of the patients. The file is composed of more than 22.6 million records that constitute the time series of continuous blood glucose level measurements. It includes the following variables:

    Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

    Measurement_date – Date of the blood glucose level measurement. Format: YYYY-MM-DD.

    Measurement_time – Time of the blood glucose level measurement. Format: HH:MM:SS.

    Measurement – Value of the blood glucose level measurement in mg/dL. Values: ranging from 40 to 500.

    Biochemical_parameters.csv

    Biochemical_parameters.csv is the file containing data of the biochemical tests performed on patients to measure their biochemical parameters. This file is composed of 87482 records and includes the following variables:

    Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

    Reception_date – Date of receipt in the laboratory of the sample to measure the biochemical parameter. Format: YYYY-MM-DD.

    Name – Name of the measured biochemical parameter. Values: 'Potassium', 'HDL cholesterol', 'Gammaglutamyl Transferase (GGT)', 'Creatinine', 'Glucose', 'Uric acid', 'Triglycerides', 'Alanine transaminase (GPT)', 'Chlorine', 'Thyrotropin (TSH)', 'Sodium', 'Glycated hemoglobin (Ac)', 'Total cholesterol', 'Albumin (urine)', 'Creatinine (urine)', 'Insulin', 'IA ANTIBODIES'.

    Value – Value of the biochemical parameter. Values: ranging from -4.0 to 6446.74.

    Diagnostics.csv

    Diagnostics.csv is the file containing diagnoses of diabetes mellitus complications or other diseases that patients have in addition to type 1 diabetes mellitus. This file is composed of 1757 records and includes the following variables:

    Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

    Code – ICD-9-CM diagnosis code. Values: subset of 594 of the ICD-9-CM codes (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes).

    Description – ICD-9-CM long description. Values: subset of 594 of the ICD-9-CM long description (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes).

    Technical Validation

    Blood glucose level measurements are collected using FreeStyle Libre devices, which are widely used for healthcare in patients with T1D. Abbott Diabetes Care, Inc., Alameda, CA, USA, the manufacturer company, has conducted validation studies of these devices concluding that the measurements made by their sensors compare to YSI analyzer devices (Xylem Inc.), the gold standard, yielding results of 99.9% of the time within zones A and B of the consensus error grid. In addition, other studies external to the company concluded that the accuracy of the measurements is adequate.

    Moreover, it was also checked in most cases the blood glucose level measurements per patient were continuous (i.e. a sample at least every 15 minutes) in the Glucose_measurements.csv file as they should be.

    Usage Notes

    For data downloading, it is necessary to be authenticated on the Zenodo platform, accept the Data Usage Agreement and send a request specifying full name, email, and the justification of the data use. This request will be processed by the Secretary of the Department of Computer Engineering, Automatics, and Robotics of the University of Granada and access to the dataset will be granted.

    The files that compose the dataset are CSV type files delimited by commas and are available in T1DiabetesGranada.zip. A Jupyter Notebook (Python v. 3.8) with code that may help to a better understanding of the dataset, with graphics and statistics, is available in UsageNotes.zip.

    Graphs_and_stats.ipynb

    The Jupyter Notebook generates tables, graphs and statistics for a better understanding of the dataset. It has four main sections, one dedicated to each file in the dataset. In addition, it has useful functions such as calculating the patient age, deleting a patient list from a dataset file and leaving only a patient list in a dataset file.

    Code Availability

    The dataset was generated using some custom code located in CodeAvailability.zip. The code is provided as Jupyter Notebooks created with Python v. 3.8. The code was used to conduct tasks such as data curation and transformation, and variables extraction.

    Original_patient_info_curation.ipynb

    In the Jupyter Notebook is preprocessed the original file with patient data. Mainly irrelevant rows and columns are removed, and the sex variable is recoded.

    Glucose_measurements_curation.ipynb

    In the Jupyter Notebook is preprocessed the original file with the continuous glucose level measurements of the patients. Principally rows without information or duplicated rows are removed and the variable with the timestamp is transformed into two new variables, measurement date and measurement time.

    Biochemical_parameters_curation.ipynb

    In the Jupyter Notebook is preprocessed the original file with patient data of the biochemical tests performed on patients to measure their biochemical parameters. Mainly irrelevant rows and columns are removed and the variable with the name of the measured biochemical parameter is translated.

    Diagnostic_curation.ipynb

    In the Jupyter Notebook is preprocessed the original file with patient data of the diagnoses of diabetes mellitus complications or other diseases that patients have in addition to T1D.

    Get_patient_info_variables.ipynb

    In the Jupyter Notebook it is coded the feature extraction process from the files Glucose_measurements.csv, Biochemical_parameters.csv and Diagnostics.csv to complete the file Patient_info.csv. It is divided into six sections, the first three to extract the features from each of the mentioned files and the next three to add the extracted features to the resulting new file.

    Data Usage Agreement

    The conditions for use are as follows:

    You confirm that you will not attempt to re-identify research participants for any reason, including for re-identification theory research.

    You commit to keeping the T1DiabetesGranada dataset confidential and secure and will not redistribute data or Zenodo account credentials.

    You will require

  5. d

    CompanyData.com (BoldData) — Vietnam Largest B2B Company Database — 1.83+...

    • datarade.ai
    Updated Apr 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CompanyData.com (BoldData) (2021). CompanyData.com (BoldData) — Vietnam Largest B2B Company Database — 1.83+ Million Verified Companies [Dataset]. https://datarade.ai/data-products/list-of-1m-companies-in-vietnam-bolddata
    Explore at:
    .json, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Apr 21, 2021
    Dataset authored and provided by
    CompanyData.com (BoldData)
    Area covered
    Vietnam
    Description

    CompanyData.com, powered by BoldData, provides verified company information sourced directly from official trade registers. Our Vietnam database features 1,828,945 company records, offering a reliable and up-to-date foundation for your business needs.

    Each Vietnamese company profile includes detailed firmographic data such as company name, registration number, legal form, industry classification, revenue, and employee count. Many records also contain contact details like emails and mobile numbers of decision-makers, helping you connect directly with the right businesses.

    Our Vietnam data is trusted for a wide range of applications including compliance, KYC verification, lead generation, market research, sales and marketing campaigns, CRM enrichment, and AI training. Every record is curated for accuracy and relevance, ensuring your strategies are built on reliable information.

    Choose the delivery method that suits your business best. We offer tailored company lists, complete national databases, real-time API access, and ready-to-use Excel or CSV files. Our enrichment services further enhance your existing data with fresh, verified information.

    With access to more than 380 million verified companies worldwide, CompanyData.com helps businesses grow locally in Vietnam and scale globally with confidence. Let us power your data-driven decisions with precision, quality, and reach.

  6. a

    TMS daily traffic counts CSV

    • hub.arcgis.com
    Updated Aug 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Waka Kotahi (2020). TMS daily traffic counts CSV [Dataset]. https://hub.arcgis.com/datasets/9cb86b342f2d4f228067a7437a7f7313
    Explore at:
    Dataset updated
    Aug 30, 2020
    Dataset authored and provided by
    Waka Kotahi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    You can also access an API version of this dataset.

    TMS

    (traffic monitoring system) daily-updated traffic counts API

    Important note: due to the size of this dataset, you won't be able to open it fully in Excel. Use notepad / R / any software package which can open more than a million rows.

    Data reuse caveats: as per license.

    Data quality

    statement: please read the accompanying user manual, explaining:

    how

     this data is collected identification 
    
     of count stations traffic 
    
     monitoring technology monitoring 
    
     hierarchy and conventions typical 
    
     survey specification data 
    
     calculation TMS 
    
     operation. 
    

    Traffic

    monitoring for state highways: user manual

    [PDF 465 KB]

    The data is at daily granularity. However, the actual update

    frequency of the data depends on the contract the site falls within. For telemetry

    sites it's once a week on a Wednesday. Some regional sites are fortnightly, and

    some monthly or quarterly. Some are only 4 weeks a year, with timing depending

    on contractors’ programme of work.

    Data quality caveats: you must use this data in

    conjunction with the user manual and the following caveats.

    The

     road sensors used in data collection are subject to both technical errors and 
    
     environmental interference.Data 
    
     is compiled from a variety of sources. Accuracy may vary and the data 
    
     should only be used as a guide.As 
    
     not all road sections are monitored, a direct calculation of Vehicle 
    
     Kilometres Travelled (VKT) for a region is not possible.Data 
    
     is sourced from Waka Kotahi New Zealand Transport Agency TMS data.For 
    
     sites that use dual loops classification is by length. Vehicles with a length of less than 5.5m are 
    
     classed as light vehicles. Vehicles over 11m long are classed as heavy 
    
     vehicles. Vehicles between 5.5 and 11m are split 50:50 into light and 
    
     heavy.In September 2022, the National Telemetry contract was handed to a new contractor. During the handover process, due to some missing documents and aged technology, 40 of the 96 national telemetry traffic count sites went offline. Current contractor has continued to upload data from all active sites and have gradually worked to bring most offline sites back online. Please note and account for possible gaps in data from National Telemetry Sites. 
    

    The NZTA Vehicle

    Classification Relationships diagram below shows the length classification (typically dual loops) and axle classification (typically pneumatic tube counts),

    and how these map to the Monetised benefits and costs manual, table A37,

    page 254.

    Monetised benefits and costs manual [PDF 9 MB]

    For the full TMS

    classification schema see Appendix A of the traffic counting manual vehicle

    classification scheme (NZTA 2011), below.

    Traffic monitoring for state highways: user manual [PDF 465 KB]

    State highway traffic monitoring (map)

    State highway traffic monitoring sites

  7. C

    PERMIT

    • data.cityofchicago.org
    csv, xlsx, xml
    Updated Jul 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2025). PERMIT [Dataset]. https://data.cityofchicago.org/Buildings/PERMIT/eq9q-jup4
    Explore at:
    csv, xml, xlsxAvailable download formats
    Dataset updated
    Jul 30, 2025
    Authors
    City of Chicago
    Description

    Violations issued by the Department of Buildings in the City of Chicago from 2006 to the present. The dataset contains more than 1 million records/rows of data and cannot be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Wordpad, to view and search. Violations are always associated to an inspection and there can be multiple violation records to one(1) inspection record. Data fields requiring description are detailed below. VIOLATION DATE: The date the violation was cited. INSPECTION CATEGORY: Inspections are categorized by one of the following: COMPLAINT – Inspection is a result of a 311 Complaint PERIODIC – Inspection is a result of recurring inspection (typically on an annual cycle) PERMIT – Inspection is a result of a Permit REGISTRATION – Inspection is a result of a Registration (typically Vacant Building Registration) PROPERTY GROUP: Properties (lots) in the City of Chicago can typically have multiple point addresses, range addresses and buildings. Examples are corner lots, large lots, lots with front and rear buildings, etc.. As a result, inspections (and their associated violations), permits and complaints related to a single property could have different addresses. This problem can be reconciled by using Property Group. All point and range addresses for a property are assigned the same Property Group key. Data Owner: Buildings Time Period: January 1, 2006 to present Frequency: Data is updated daily Related Applications: Building Data Warehouse http://www.cityofchicago.org/city/en/depts/bldgs/provdrs/inspect/svcs/building_violationsonline.html

  8. Z

    88.6 Million Developer Comments from GitHub

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin S. Meyers (2024). 88.6 Million Developer Comments from GitHub [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5596536
    Explore at:
    Dataset updated
    Jan 4, 2024
    Dataset provided by
    Andrew Meneely
    Benjamin S. Meyers
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    This is a collection of developer comments from GitHub issues, commits, and pull requests. We collected 88,640,237 developer comments from 17,378 repositories. In total, this dataset includes:

    54,252,380 issue comments (from 13,458,208 issues)

    979,642 commit comments (from 49,710,108 commits)

    33,408,215 pull request comments (from 12,680,373 pull requests)

    Warning: The uploaded dataset is compressed from 185GB down to 25.1GB.

    Purpose

    The purpose of this dataset (corpus) is to provide a large dataset of software developer comments (natural language) for research. We intend to use this data in our own research, but we hope it will be helpful for other researchers.

    Collection Process

    Full implementation details can be found in the following publication:

    Benjamin S. Meyers. Human Error Assessment in Software Engineering. Rochester Institute of Technology. 2023.

    Data was downloaded using GitHub's GraphQL API via requests made with Python's requests library. We targeted 17,491 repositories with the following criteria:

    At least 850 stars.

    Primary language in the Top 50 from the TIOBE Index and/or listed as "popular" in GitHub's advanced search. Note that we collected the list of languages on August 31, 2021.

    Due to design decisions made by GitHub, we could only get a list of at most 1,000 repositories for each target language. Comments from 113 repositories could not be downloaded for various reasons (failing API queries, JSONDecoderErrors, etc.). Eight target languages had no repositories matching the above criteria.

    After collection using the GraphQL API, data was written to CSV using Python's csv.writer class. We highly recommend using Python's csv.reader to parse these CSV files as no newlines have been removed from developer comments.

    88_million_developer_comments.zip

    This zip file contains 135 CSV files; 3 per language. CSV names are formatted _.csv, with being the name of the primary language and being one of co (commits), is (issues), or pr (pull requests).

    Languages included are: ABAP, Assembly, C, C# (C-Sharp), C++ (C-PlusPlus), Clojure, COBOL, CoffeeScript, CSS, Dart, D, DM, Elixir, Fortran, F# (F-Sharp), Go, Groovy, HTML, Java, JavaScript, Julia, Kotlin, Lisp, Lua, MATLAB, Nim, Objective-C, Pascal, Perl, PHP, PowerShell, Prolog, Python, R, Ruby, Rust, Scala, Scheme, Scratch, Shell, Swift, TSQL, TypeScript, VBScript, and VHDL.

    Details on the columns in each CSV file are described in the provided README.md.

    Detailed_Breakdown.ods

    This spreadsheet contains specific details on how many repositories, commits, issues, pull requests, and comments are included in 88_million_developer_comments.zip.

    Note On Completeness

    We make no guarantee that every commit, issue, and/or pull request for each repository is included in this dataset. Due to the nature of the GraphQL API and data decoding difficulties, sometimes a query failed and that data is not included here.

    Versioning

    v1.1: The original corpus had duplicate header rows in the CSV files. This has been fixed.

    v1.0: Original corpus.

    Contact

    Please contact Benjamin S. Meyers (email) with questions about this data and its collection.

    Acknowledgments

    Collection of this data has been sponsored in part by the National Science Foundation grant 1922169, and by a Department of Defense DARPA SBIR program (grant 140D63-19-C-0018).

    This data was collected using the compute resources from the Research Computing department at the Rochester Institute of Technology. doi:10.34788/0S3G-QD15

  9. c

    BBC News Dataset – February 2023 Edition

    • crawlfeeds.com
    csv, zip
    Updated Jun 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). BBC News Dataset – February 2023 Edition [Dataset]. https://crawlfeeds.com/datasets/bbc-news-dataset-feb-2023
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Jun 14, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Get access to a comprehensive and structured dataset of BBC News articles, freshly crawled and compiled in February 2023. This collection includes 1 million records from one of the world’s most trusted news organizations — perfect for training NLP models, sentiment analysis, and trend detection across global topics.

    💾 Format: CSV (available in ZIP archive)

    📢 Status: Published and available for immediate access

    Use Cases

    • Train language models to summarize or categorize news

    • Detect media bias and compare narrative framing

    • Conduct research in journalism, politics, and public sentiment

    • Enrich news aggregation platforms with clean metadata

    • Analyze content distribution across categories (e.g. health, politics, tech)

    This dataset ensures reliable and high-quality information sourced from a globally respected outlet. The format is optimized for quick ingestion into your pipelines — with clean text, timestamps, image links, and more.

    Need a filtered dataset or want this refreshed for a later date? We offer on-demand news scraping as well.

    👉 Request access or sample now

  10. n

    FOI-01943 - Datasets - Open Data Portal

    • opendata.nhsbsa.net
    Updated Jun 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). FOI-01943 - Datasets - Open Data Portal [Dataset]. https://opendata.nhsbsa.net/dataset/foi-01943
    Explore at:
    Dataset updated
    Jun 12, 2024
    Description

    https://opendata.nhsbsa.net/dataset/foi-01502 September 2023 https://opendata.nhsbsa.net/dataset/foi-01550 October 2023 https://opendata.nhsbsa.net/dataset/foi-01668 November 2023 https://opendata.nhsbsa.net/dataset/foi-01669 December 2023 https://opendata.nhsbsa.net/dataset/foi-01756 Some data sets are over 1 million rows of data and it may be that you will need to use add-ons already existing on Microsoft Excel to enable you to view the data set in its entirety. Microsoft PowerPivot add-on for Excel can be used to handle larger data sets. The Microsoft PowerPivot add-on for Excel is available using the link in the 'Related Links' section below: https://www.microsoft.com/en-us/download/details.aspx?id=43348 Once PowerPivot has been installed, to load the large files, please follow the instructions below: 1. Start Excel as normal 2. Click on the PowerPivot tab 3. Click on the PowerPivot Window icon (top left) 4. In the PowerPivot Window, click on the "From Other Sources" icon 5. In the Table Import Wizard e.g. scroll to the bottom and select Text File 6. Browse to the file you want to open and choose the file extension you require e.g. CSV

  11. Z

    BF skip indexes for Ethereum

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loporchio, Matteo (2024). BF skip indexes for Ethereum [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7957140
    Explore at:
    Dataset updated
    Dec 26, 2024
    Dataset authored and provided by
    Loporchio, Matteo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General information

    This repository includes all data needed to reproduce the experiments presented in [1].The paper describes the BF skip index, a data structure based on Bloom filters [2] that can be used for answering inter-block queries on blockchains efficiently. The article also includes a historical analysis of logsBloom filters included in the Ethereum block headers, as well as an experimental analysis of the proposed data structure. The latter was conducted using the data set of events generated by the CryptoKitties Core contract, a popular decentralized application launched in 2017 (and also one of the first applications based on NFTs).

    In this description, we use the following abbreviations (also adopted throughout the paper) to denote two different sets of Ethereum blocks.

    D1: set of all Ethereum blocks between height 0 and 14999999.

    D2: set of all Ethereum blocks between height 14000000 and 14999999.

    Moreover, in accordance with the terminology adopted in the paper, we define the set of keys of a block as the set of all contract addresses and log topics of the transactions in the block. As defined in [3], log topics comprise event signature digests and the indexed parameters associated with the event occurrence.

    Data set description

    File Description

    filters_ones_0-14999999.csv.xz Compressed CSV file containing the number of ones for each logsBloom filter in D1.

    receipt_stats_0-14999999.csv.xz Compressed CSV file containing statistics about all transaction receipts in D1.

    Approval.csv CSV file containing the Approval event occurrences for the CryptoKitties Core contract in D2.

    Birth.csv CSV file containing the Birth event occurrences for the CryptoKitties Core contract in D2.

    Pregnant.csv CSV file containing the Pregnant event occurrences for the CryptoKitties Core contract in D2.

    Transfer.csv CSV file containing the Transfer event occurrences for the CryptoKitties Core contract in D2.

    events.xz Compressed binary file containing information about all contract events in D2.

    keys.xz Compressed binary file containing information about all keys in D2.

    File structure

    We now describe the structure of the files included in this repository.

    filters_ones_0-14999999.csv.xz is a compressed CSV file with 15 million rows (one for each block in D1) and 3 columns. Note that it is not necessary to decompress this file, as the provided code is capable of processing it directly in its compressed form. The columns have the following meaning.

    blockId: the identifier of the block.

    timestamp: timestamp of the block.

    numOnes: number of bits set to 1 in the logsBloom filter of the block.

    receipt_stats_0-14999999.csv.xz is a compressed CSV file with 15 million rows (one for each block in D1) and 5 columns. As for the previous file, it is not necessary to decompress this file.

    blockId: the identifier of the block.

    txCount: number of transactions included in the block.

    numLogs: number of event logs included in the block.

    numKeys: number of keys included in the block.

    numUniqueKeys: number of distinct keys in the block (useful as the same key may appear multiple times).

    All CSV files related to the CryptoKitties Core events (i.e., Approval.csv, Birth.csv, Pregnant.csv, Transfer.csv) have the same structure. They consist of 1 million rows (one for each block in D2) and 2 columns, namely:

    blockId: identifier of the block.

    numOcc: number of event occurrences in the block.

    events.xz is a compressed binary file describing all unique event occurrences in the blocks of D2. The file contains 1 million data chunks (i.e., one for each Ethereum block). Each chunk includes the following information. Do note that this file only records unique event occurrences in each block, meaning that if an event from a contract is triggered more than once within the same block, there will be only one sequence within the corresponding chunk.

    blockId: identifier of the block (4 bytes).

    numEvents: number of event occurrences in the block (4 bytes).

    A list of numEvent sequences, each made up of 52 bytes. A sequence represents an event occurrence and is indeed the concatenation of two fields, namely:

    Address of the contract triggering the event (20 bytes).

    Event signature digest (32 bytes).

    keys.xz is a compressed binary file describing all unique keys in the blocks of D2. As for the previous file, duplicate keys only appear once. The file contains 1 million data chunks, each representing an Ethereum block and including the following information.

    blockId: identifier of the block (4 bytes)

    numAddr: number of unique contract addresses (4 bytes).

    numTopics: number of unique topics (4 bytes).

    A sequence of numAddr addresses, each represented using 20 bytes.

    A sequence of numTopics topics, each represented using 32 bytes.

    Notes

    For space reasons, some of the files in this repository have been compressed using the XZ compression utility. Unless otherwise specified, these files need to be decompressed before they can be read. Please make sure you have an application installed on your system that is capable of decompressing such files.

    Cite this work

    If the data included in this repository have been useful, please cite the following article in your work.

    @article{loporchio2025skip, title={Skip index: Supporting efficient inter-block queries and query authentication on the blockchain}, author={Loporchio, Matteo and Bernasconi, Anna and Di Francesco Maesa, Damiano and Ricci, Laura}, journal={Future Generation Computer Systems}, volume={164}, pages={107556}, year={2025}, publisher={Elsevier} }

    References

    Loporchio, Matteo et al. "Skip index: supporting efficient inter-block queries and query authentication on the blockchain". Future Generation Computer Systems 164 (2025): 107556. https://doi.org/10.1016/j.future.2024.107556

    Bloom, Burton H. "Space/time trade-offs in hash coding with allowable errors." Communications of the ACM 13.7 (1970): 422-426.

    Wood, Gavin. "Ethereum: A secure decentralised generalised transaction ledger." Ethereum project yellow paper 151.2014 (2014): 1-32.

  12. C

    Stop Work Orders Plans Needed

    • data.cityofchicago.org
    csv, xlsx, xml
    Updated Jul 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2025). Stop Work Orders Plans Needed [Dataset]. https://data.cityofchicago.org/Buildings/Stop-Work-Orders-Plans-Needed/mpys-3gzj
    Explore at:
    xml, csv, xlsxAvailable download formats
    Dataset updated
    Jul 31, 2025
    Authors
    City of Chicago
    Description

    Violations issued by the Department of Buildings in the City of Chicago from 2006 to the present. The dataset contains more than 1 million records/rows of data and cannot be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Wordpad, to view and search. Violations are always associated to an inspection and there can be multiple violation records to one(1) inspection record. Data fields requiring description are detailed below. VIOLATION DATE: The date the violation was cited. INSPECTION CATEGORY: Inspections are categorized by one of the following: COMPLAINT – Inspection is a result of a 311 Complaint PERIODIC – Inspection is a result of recurring inspection (typically on an annual cycle) PERMIT – Inspection is a result of a Permit REGISTRATION – Inspection is a result of a Registration (typically Vacant Building Registration) PROPERTY GROUP: Properties (lots) in the City of Chicago can typically have multiple point addresses, range addresses and buildings. Examples are corner lots, large lots, lots with front and rear buildings, etc.. As a result, inspections (and their associated violations), permits and complaints related to a single property could have different addresses. This problem can be reconciled by using Property Group. All point and range addresses for a property are assigned the same Property Group key. Data Owner: Buildings Time Period: January 1, 2006 to present Frequency: Data is updated daily Related Applications: Building Data Warehouse http://www.cityofchicago.org/city/en/depts/bldgs/provdrs/inspect/svcs/building_violationsonline.html

  13. 🚴🗃️ BCN Bike Sharing Dataset - Bicing Stations

    • kaggle.com
    Updated Apr 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enric Domingo (2024). 🚴🗃️ BCN Bike Sharing Dataset - Bicing Stations [Dataset]. https://www.kaggle.com/datasets/edomingo/bicing-stations-dataset-bcn-bike-sharing/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 21, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Enric Domingo
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains 250 million rows of information from the ~500 bike stations of the Barcelona public bicycle sharing service. The data consists in time series information of the electric and mechanical bicycles available every 4 minutes aprox., from March 2019 to March 2024 (latest available csv file, with the idea of being updated with every new month's file). This data could inspire many different use cases, from geographical data analysis to hierarchical ML time series models or Graph Neural Networks among others. Feel free to create a New Notebook from this page to use it and share your ideas with everyone!

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3317928%2F64409b5bd3c220993e05f5e155fd8c25%2Fstations_map_2024.png?generation=1713725887609128&alt=media" alt="">

    Every month's information is separated in a different file as {year}_{month}_STATIONS.csv. Then the metadata info of every station has been simplified and compressed in the {year}_INFO.csv files where there is a single entry for every station and day, separated in a different file for every year.

    The original data has some different errors, few of them have been already corrected but there are still some missing values, columns with wrong data types and other fewer artifacts or missing data. From time to time I may be manually correcting more of those.

    The data is collected from the public BCN Open Data website, which is available for everyone (some resources need from creating a free account and token): - Stations data: https://opendata-ajuntament.barcelona.cat/data/en/dataset/estat-estacions-bicing - Stations info: https://opendata-ajuntament.barcelona.cat/data/en/dataset/informacio-estacions-bicing

    You can find more information in them.

    Please, consider upvoting this dataset if you find it interesting! 🤗

    Some observations:
    The historical data for June '19 does not have data for the 20th between 7:40 am and 2:00 pm.
    The historical data for July '19 does not have data from the 26th at 1:30 pm until the 29th at 10:40 am.
    The historical data for November '19 may not have some data from 10:00 pm on the 26th to 11:00 am on the 27th.
    The historical data for August '20 does not have data from the 7th at 2:25 am until the 10th at 10:40 am.
    The historical data for November '20 does not have data on the following days/times: 4th from 1:45 am to 11:05 am 20th from 7:50 pm to the 21st at 10:50 am 27th from 2:50 am to the 30th at 9:50 am.
    The historical data for August '23 does not have data from the 22nd to the 31st due to a technical incident.
    The historical data for September '23 does not have data from the 1st to the 5th due to a technical incident.
    The historical data for February '24 does not have data on the 5th between 12:50 pm and 1:05 pm.
    Others: Due to COVID-19 measures, the Bicing service was temporarily stopped, reflecting this situation in the historical data.

    Field Description:

    Array of data for each station:

    station_id: Identifier of the station
    num_bikes_available: Number of available bikes
    num_bikes_available_types: Array of types of available bikes
    mechanical: Number of available mechanical bikes
    ebike: Number of available electric bikes
    num_docks_available: Number of available docks
    is_installed: The station is properly installed (0-NO,1-YES)
    is_renting: The station is providing bikes correctly
    is_returning: The station is docking bikes correctly
    last_reported: Timestamp of the station information
    is_charging_station: The station has electric bike charging capacity
    status: Status of the station (IN_SERVICE=In service, CLOSED=Closed)

  14. C

    4408 S Drexel et al

    • data.cityofchicago.org
    application/rdfxml +5
    Updated Jul 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2025). 4408 S Drexel et al [Dataset]. https://data.cityofchicago.org/Buildings/4408-S-Drexel-et-al/u3vp-ihxr
    Explore at:
    tsv, application/rdfxml, csv, application/rssxml, json, xmlAvailable download formats
    Dataset updated
    Jul 29, 2025
    Authors
    City of Chicago
    Area covered
    South Drexel Boulevard
    Description

    Violations issued by the Department of Buildings in the City of Chicago from 2006 to the present. The dataset contains more than 1 million records/rows of data and cannot be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Wordpad, to view and search. Violations are always associated to an inspection and there can be multiple violation records to one(1) inspection record. Data fields requiring description are detailed below. VIOLATION DATE: The date the violation was cited. INSPECTION CATEGORY: Inspections are categorized by one of the following: COMPLAINT – Inspection is a result of a 311 Complaint PERIODIC – Inspection is a result of recurring inspection (typically on an annual cycle) PERMIT – Inspection is a result of a Permit REGISTRATION – Inspection is a result of a Registration (typically Vacant Building Registration) PROPERTY GROUP: Properties (lots) in the City of Chicago can typically have multiple point addresses, range addresses and buildings. Examples are corner lots, large lots, lots with front and rear buildings, etc.. As a result, inspections (and their associated violations), permits and complaints related to a single property could have different addresses. This problem can be reconciled by using Property Group. All point and range addresses for a property are assigned the same Property Group key. Data Owner: Buildings Time Period: January 1, 2006 to present Frequency: Data is updated daily Related Applications: Building Data Warehouse http://www.cityofchicago.org/city/en/depts/bldgs/provdrs/inspect/svcs/building_violationsonline.html

  15. d

    CompanyData.com (BoldData) — USA Largest B2B Company Database — 69.9+...

    • datarade.ai
    Updated Apr 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CompanyData.com (BoldData) (2021). CompanyData.com (BoldData) — USA Largest B2B Company Database — 69.9+ Million Verified Companies [Dataset]. https://datarade.ai/data-products/list-of-55m-companies-in-united-states-of-america-bolddata
    Explore at:
    .json, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Apr 20, 2021
    Dataset authored and provided by
    CompanyData.com (BoldData)
    Area covered
    United States
    Description

    CompanyData.com powered by BoldData is your gateway to verified, high-quality business data from around the world. We specialize in delivering structured company information sourced directly from official trade registers, giving you reliable data to fuel smarter business decisions.

    Our USA company database includes over 69,853,300 verified business records, making it one of the most comprehensive sources of company information available. Each record contains detailed firmographics such as industry classification, company size and revenue, corporate hierarchies and verified contact details including decision-maker names, email addresses, direct dials and mobile numbers.

    This rich dataset supports a wide range of use cases including - Regulatory compliance and KYC verification - Sales prospecting and lead generation - B2B marketing and audience segmentation - CRM enrichment and data cleansing - Training data for AI and machine learning models

    We offer flexible delivery options tailored to your workflow - Tailored company lists filtered by location, size, industry and more - Full USA company database exports in Excel or CSV - Real-time API access for seamless data integration - Data enrichment services to enhance your internal records

    The United States is a key part of our global database of over 69,853,300 verified companies across more than 200 countries. Whether you are expanding into the US market or enriching global CRM systems, we deliver the accuracy, scale and flexibility your business demands.

    Partner with CompanyData.com to unlock actionable company intelligence in the USA delivered how you need it, when you need it, with the precision your business deserves.

  16. d

    Dataplex: FDA Medical Device Data | 24M+ Rows of Key Device Product Data for...

    • datarade.ai
    .csv
    Updated Aug 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataplex (2024). Dataplex: FDA Medical Device Data | 24M+ Rows of Key Device Product Data for Research & Analysis [Dataset]. https://datarade.ai/data-products/dataplex-fda-medical-device-data-24m-rows-of-key-device-i-dataplex
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Aug 12, 2024
    Dataset authored and provided by
    Dataplex
    Area covered
    United States of America
    Description

    The FDA Device Dataset by Dataplex provides comprehensive access to over 24 million rows of detailed information, covering 9 key data types essential for anyone involved in the medical device industry. Sourced directly from the U.S. Food and Drug Administration (FDA), this dataset is a critical resource for regulatory compliance, market analysis, and product safety assessment regarding.

    Dataset Overview:

    This dataset includes data on medical device registrations, approvals, recalls, and adverse events, among other crucial aspects. The dataset is meticulously cleaned and structured to ensure that it meets the needs of researchers, regulatory professionals, and market analysts.

    24 Million Rows of Data:

    With over 24 million rows, this dataset offers an extensive view of the regulatory landscape for medical devices. It includes data types such as classification, event, enforcement, 510k, registration listings, recall, PMA, UDI, and covid19 serology. This wide range of data types allows users to perform granular analysis on a broad spectrum of device-related topics.

    Sourced from the FDA:

    All data in this dataset is sourced directly from the FDA, ensuring that it is accurate, up-to-date, and reliable. Regular updates ensure that the dataset remains current, reflecting the latest in device approvals, clearances, and safety reports.

    Key Features:

    • Comprehensive Coverage: Includes 9 key device data types, such as 510(k) clearances, premarket approvals, device classifications, and adverse event reports.

    • Regulatory Compliance: Provides detailed information necessary for tracking compliance with FDA regulations, including device recalls and enforcement actions.

    • Market Analysis: Analysts can utilize the dataset to assess market trends, monitor competitor activities, and track the introduction of new devices.

    • Product Safety Analysis: Researchers can analyze adverse event reports and device recalls to evaluate the safety and performance of medical devices.

    Use Cases: - Regulatory Compliance: Ensure your devices meet FDA standards, monitor compliance trends, and stay informed about regulatory changes.

    • Market Research: Identify trends in the medical device market, track new device approvals, and analyze competitive landscapes with up-to-date and historical data.

    • Product Safety: Assess the safety and performance of medical devices by examining detailed adverse event reports and recall data.

    Data Quality and Reliability:

    The FDA Device Dataset prioritizes data quality and reliability. Each record is meticulously sourced from the FDA's official databases, ensuring that the information is both accurate and up-to-date. This makes the dataset a trusted resource for critical applications, where data accuracy is vital.

    Integration and Usability:

    The dataset is provided in CSV format, making it compatible with most data analysis tools and platforms. Users can easily import, analyze, and utilize the data for various applications, from regulatory reporting to market analysis.

    User-Friendly Structure and Metadata:

    The data is organized for easy navigation, with clear metadata files included to help users identify relevant records. The dataset is structured by device type, approval and clearance processes, and adverse event reports, allowing for efficient data retrieval and analysis.

    Ideal For:

    • Regulatory Professionals: Monitor FDA compliance, track regulatory changes, and prepare for audits with comprehensive and up-to-date product data.

    • Market Analysts: Conduct detailed research on market trends, assess new device entries, and analyze competitive dynamics with extensive FDA data.

    • Healthcare Researchers: Evaluate the safety and efficacy of medical devices product data, identify potential risks, and contribute to improved patient outcomes through detailed analysis.

    This dataset is an indispensable resource for anyone involved in the medical device industry, providing the data and insights necessary to drive informed decisions and ensure compliance with FDA regulations.

  17. Data Citation Corpus Data File

    • zenodo.org
    • redivis.com
    zip
    Updated May 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataCite (2024). Data Citation Corpus Data File [Dataset]. http://doi.org/10.5281/zenodo.11216814
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 20, 2024
    Dataset provided by
    DataCitehttps://www.datacite.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data file for the first release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.

    The data file includes 10,006,058 data citation records in JSON and CSV formats. The JSON file is the version of record.

    Version 1.0 of the corpus data file was released on January 30, 2024. Release v1.1 is an optimized version of v1.0 designed to make the original citation records more usable. No citations have been added to or removed from the dataset in v1.1.

    For convenience, the data file is provided in batches of approximately 1 million records each. The publication date and batch number are included in each component file name, ex: 2024-05-10-data-citation-corpus-01-v1.1.json.

    The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.

    Each data citation record is comprised of:

    • A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication object (journal article or preprint) in which the dataset is cited

    • Metadata for the cited dataset and for the citing publication object

    The data file includes the following fields:

    Field

    Description

    Required?

    id

    Internal identifier for the citation

    Yes

    created

    Date of item's incorporation into the corpus

    Yes

    updated

    Date of item's most recent update in corpus

    Yes

    repository

    Repository where cited data is stored

    No

    publisher

    Publisher for the article citing the data

    No

    journal

    Journal for the article citing the data

    No

    title

    Title of cited data

    No

    objId

    DOI of article where data is cited

    Yes

    subjId

    DOI or accession number of cited data

    Yes

    publishedDate

    Date when citing article was published

    No

    accessionNumber

    Accession number of cited data

    No

    doi

    DOI of cited data

    No

    relationTypeId

    Relation type in metadata between citation object and subject

    No

    source

    Source where citation was harvested

    Yes

    subjects

    Subject information for cited data

    No

    affiliations

    Affiliation information for creator of cited data

    No

    funders

    Funding information for cited data

    No

    Additional documentation about the citations and metadata in the file is available on the Make Data Count website.

    Feedback on the data file can be submitted via Github. For general questions, email info@makedatacount.org.

  18. c

    Fox News dataset is for analyzing media trends and narratives

    • crawlfeeds.com
    csv, zip
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Fox News dataset is for analyzing media trends and narratives [Dataset]. https://crawlfeeds.com/datasets/fox-news-dataset
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    The Fox News Dataset is a comprehensive collection of over 1 million news articles, offering an unparalleled resource for analyzing media narratives, public discourse, and political trends. Covering articles up to the year 2023, this dataset is a treasure trove for researchers, analysts, and businesses interested in gaining deeper insights into the topics and trends covered by Fox News.

    Key Features of the Fox News Dataset

    • Extensive Coverage: Contains more than 1 million articles spanning various topics and events up to 2023.
    • Research-Ready: Perfect for text classification, natural language processing (NLP), and other research purposes.
    • Format: Provided in CSV format for seamless integration into analytical and research tools.

    Why Use This Dataset?

    This large dataset is ideal for:

    • Text Classification: Develop machine learning models to classify and categorize news content.
    • Natural Language Processing (NLP): Conduct sentiment analysis, keyword extraction, or topic modeling.
    • Media and Political Research: Analyze media narratives, public opinion, and political trends reflected in Fox News articles.
    • Trend Analysis: Identify shifts in public discourse and media focus over time.

    Explore More News Datasets

    Discover additional resources for your research needs by visiting our news dataset collection. These datasets are tailored to support diverse analytical applications, including sentiment analysis and trend modeling.

    The Fox News Dataset is a must-have for anyone interested in exploring large-scale media data and leveraging it for advanced analysis. Ready to dive into this wealth of information? Download the dataset now in CSV format and start uncovering the stories behind the headlines.

  19. MNAD : Moroccan News Articles Dataset

    • kaggle.com
    Updated Jan 16, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JM100 (2022). MNAD : Moroccan News Articles Dataset [Dataset]. https://www.kaggle.com/jmourad100/mnad-moroccan-news-articles-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 16, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    JM100
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The MNAD corpus is a collection of over 1 million Moroccan news articles written in modern Arabic language. These news articles have been gathered from 11 prominent electronic news sources. The dataset is made available to the academic community for research purposes, such as data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), and other non-commercial activities.

    Dataset Fields

    • Title: The title of the article
    • Body: The body of the article
    • Category: The category of the article
    • Source: The Electronic News paper source of the article

    About Version 1 of the Dataset (MNAD.v1)

    Version 1 of the dataset comprises 418,563 articles classified into 19 categories. The data was collected from well-known electronic news sources, namely Akhbarona.ma, Hespress.ma, Hibapress.com, and Le360.com. The articles were stored in four separate CSV files, each corresponding to the news website source. Each CSV file contains three fields: Title, Body, and Category of the news article.

    The dataset is rich in Arabic vocabulary, with approximately 906,125 unique words. It has been utilized as a benchmark in the research paper: "A Moroccan News Articles Dataset (MNAD) For Arabic Text Categorization". In 2021 International Conference on Decision Aid Sciences and Application (DASA).

    This dataset is available for download from the following sources: - Kaggle Datasets : MNADv1 - Huggingface Datasets: MNADv1

    About Version 2 of the Dataset (MNAD.v2)

    Version 2 of the MNAD dataset includes an additional 653,901 articles, bringing the total number of articles to over 1 million (1,069,489), classified into the same 19 categories as in version 1. The new documents were collected from seven additional prominent Moroccan news websites, namely al3omk.com, medi1news.com, alayam24.com, anfaspress.com, alyaoum24.com, barlamane.com, and SnrtNews.com.

    The newly collected articles have been merged with the articles from the previous version into a single CSV file named MNADv2.csv. This file includes an additional column called "Source" to indicate the source of each news article.

    Furthermore, MNAD.v2 incorporates improved pre-processing techniques and data cleaning methods. These enhancements involve removing duplicates, eliminating multiple spaces, discarding rows with NaN values, replacing new lines with " ", excluding very long and very short articles, and removing non-Arabic articles. These additions and improvements aim to enhance the usability and value of the MNAD dataset for researchers and practitioners in the field of Arabic text analysis.

    This dataset is available for download from the following sources: - Kaggle Datasets : MNADv2 - Huggingface Datasets: MNADv2

    Citation

    If you use our data, please cite the following paper:

    @inproceedings{MNAD2021,
      author  = {Mourad Jbene and 
             Smail Tigani and 
             Rachid Saadane and 
             Abdellah Chehri},
      title   = {A Moroccan News Articles Dataset ({MNAD}) For Arabic Text Categorization},
      year   = {2021},
      publisher = {{IEEE}},
      booktitle = {2021 International Conference on Decision Aid Sciences and Application ({DASA})}
      doi    = {10.1109/dasa53625.2021.9682402},
      url    = {https://doi.org/10.1109/dasa53625.2021.9682402},
    }
    
  20. d

    CompanyData.com (BoldData) — Greece Largest B2B Company Database — 1.07+...

    • datarade.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CompanyData.com (BoldData), CompanyData.com (BoldData) — Greece Largest B2B Company Database — 1.07+ Million Verified Companies [Dataset]. https://datarade.ai/data-products/list-of-1m-companies-in-greece-bolddata
    Explore at:
    .json, .csv, .xls, .txtAvailable download formats
    Dataset authored and provided by
    CompanyData.com (BoldData)
    Area covered
    Greece
    Description

    Verified company data in Greece from CompanyData.com (BoldData)

    CompanyData.com (BoldData) delivers accurate, up-to-date company information sourced directly from official Greek trade registers and reliable local authorities. Whether you're entering the Greek market or enhancing existing operations, our verified data helps you identify, understand, and connect with the right businesses.

    Over 1,073,791 verified company records in Greece Our Greek business database covers over 1.3 million companies, from family-owned enterprises in Thessaloniki to major corporations based in Athens. Every record is verified and updated regularly to ensure data integrity, regulatory compliance, and maximum relevance.

    Key data fields include:

    • Company names, registration numbers, and legal structures • Contact information including emails, phone numbers, and mobile numbers • Executive names and key decision-makers • Industry classification, turnover, number of employees, and founding year • Business status, operational activity, and hierarchy insights

    Use Greece company data for a wide range of applications Our verified data supports many essential business functions:

    • KYC and anti-money laundering compliance • B2B sales prospecting and marketing campaigns • CRM and database enrichment • Market research and strategic analysis • Artificial intelligence and machine learning training

    Flexible delivery tailored to your needs Whether you need a one-time list or a continuous data stream, we offer:

    • Custom-built B2B lists tailored to your target audience • Full datasets of all active Greek companies • Real-time API access for seamless integration • Delivery in Excel, CSV, or through our easy-to-use platform

    Part of a global business data network Our Greece dataset is part of our worldwide coverage of 1,073,791 verified company records across more than 200 countries. With years of experience and a strong commitment to quality, CompanyData.com (BoldData) helps organizations worldwide find the right business data to grow, scale, and stay compliant.

    Access trusted company data from Greece and beyond with CompanyData.com.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jeremy Sheff; Jeremy Sheff (2024). The Canada Trademarks Dataset [Dataset]. http://doi.org/10.5281/zenodo.4999655
Organization logo

The Canada Trademarks Dataset

Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
zip, pdfAvailable download formats
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jeremy Sheff; Jeremy Sheff
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
Canada
Description

The Canada Trademarks Dataset

18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303

Dataset Selection and Arrangement (c) 2021 Jeremy Sheff

Python and Stata Scripts (c) 2021 Jeremy Sheff

Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.

This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.

Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.

Terms of Use:

As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.

The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:

The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.

Details of Repository Contents:

This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:

  • /csv: contains the .csv versions of the data files
  • /do: contains Stata do-files used to convert the .csv files to .dta format and perform the statistical analyses set forth in the paper reporting this dataset
  • /dta: contains the .dta versions of the data files
  • /py: contains the python scripts used to download CIPO’s historical trademarks data via SFTP and generate the .csv data files

If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.

The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.

With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.

The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.

This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.

Search
Clear search
Close search
Google apps
Main menu