97 datasets found
  1. D

    Document Formatting Service Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Document Formatting Service Report [Dataset]. https://www.marketreportanalytics.com/reports/document-formatting-service-75560
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Apr 10, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global document formatting services market is experiencing robust growth, driven by the increasing demand for professionally formatted documents across various sectors. The market's expansion is fueled by several key factors. Firstly, the proliferation of digital documents and the need for consistent branding and professional presentation across all communication channels are driving demand. Secondly, the rising complexity of document creation, particularly in fields like legal and finance, necessitates specialized formatting expertise. Businesses are increasingly outsourcing this function to focus on core competencies, leading to significant market expansion. The academic sector also contributes substantially, with students and researchers requiring formatting assistance for theses, dissertations, and research papers. While specific market size figures aren't provided, considering the growth in related sectors like digital publishing and freelance editing, a reasonable estimation for the 2025 market size could be around $2.5 billion, growing at a conservative Compound Annual Growth Rate (CAGR) of 10% over the forecast period (2025-2033). This growth is largely segmented across different application areas, with the business and legal sectors showing particularly strong demand. The service itself is divided across document types, with Word documents, PowerPoint presentations, and Excel spreadsheets representing the largest shares. North America and Europe currently hold the largest market shares, but growth potential is high in the Asia-Pacific region, driven by burgeoning economies and increased digital adoption. Despite its growth trajectory, the market faces some challenges. Competition amongst numerous providers, ranging from large outsourcing firms to individual freelancers, can lead to price pressure. The need for specialized expertise within specific document formatting standards (e.g., legal citations) requires continuous investment in training and upskilling. Moreover, concerns about data security and confidentiality within client documents are areas that providers must address effectively. The evolving technological landscape, with the potential introduction of more advanced automated formatting tools, also represents a long-term challenge. However, the ongoing demand for high-quality, error-free documentation suggests that human-driven expertise in document formatting will remain highly relevant and in demand for the foreseeable future.

  2. d

    Argos Satellite Tracking Data for Thick-billed Murres (Uria lomvia) - Raw...

    • catalog.data.gov
    • data.usgs.gov
    Updated Nov 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Argos Satellite Tracking Data for Thick-billed Murres (Uria lomvia) - Raw Data [Dataset]. https://catalog.data.gov/dataset/argos-satellite-tracking-data-for-thick-billed-murres-uria-lomvia-raw-data
    Explore at:
    Dataset updated
    Nov 10, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    This metadata document describes the data contained in the "rawData" folder of this data package. This data package contains all data collected by the Argos System from 20 satellite transmitters attached to Thick-billed murres on their breeding range in arctic and western Alaska, 1995-1996. Five data files are included in the "rawData" folder of this data package. Two data files (with identical content) contain the raw Argos DIAG (Diagnostic) data, one in the legacy verbose ASCII format and one in a tabular Comma Separate Value (CSV) format. Two other data files (with identical content) contain the raw Argos DS (Dispose) data, one in the legacy verbose ASCII format and one in a tabular CSV format. The fifth file, "deploymentAttributes", contains one record for each transmitter deployment in a CSV formatted table. The deployment attributes file contains information such as when the transmitter was attached to the animal, when tracking of a live animal ended, and a variety of variables describing the animal and transmitter. This table is identical to the "deploymentAttributes" table in the "processedData" folder of this data package.

  3. Data from: A large synthetic dataset for machine learning applications in...

    • zenodo.org
    csv, json, png, zip
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
    Explore at:
    zip, png, csv, jsonAvailable download formats
    Dataset updated
    Mar 25, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

    This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

    Data generation algorithm

    The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

    Network

    The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

    Time series

    The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

    There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

    Usage

    The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

    Selecting a particular country

    This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

    import pandas as pd
    CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

    The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

    CH_gens_list = CH_gens.dropna().squeeze().to_list()

    Finally, we can import all the time series of Swiss generators from a given data table with

    pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

    The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

    Averaging over time

    This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

    hourly_loads = pd.read_csv('loads_2018_3.csv')

    To get a daily average of the loads, we can use:

    daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

    This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

    weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

    Source code

    The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

    Funding

    This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.

  4. Mobile Source Emissions Regulatory Compliance Data Inventory

    • catalog.data.gov
    • cloud.csiss.gmu.edu
    • +2more
    Updated Nov 30, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Air and Radiation (OAR) - Office of Transportation and Air Quality (OTAQ) (2020). Mobile Source Emissions Regulatory Compliance Data Inventory [Dataset]. https://catalog.data.gov/dataset/mobile-source-emissions-regulatory-compliance-data-inventory
    Explore at:
    Dataset updated
    Nov 30, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The Mobile Source Emissions Regulatory Compliance Data Inventory data asset contains measured summary compliance information on light-duty, heavy-duty, and non-road engine manufacturers by model, as well as fee payment data required by Title II of the 1990 Amendments to the Clean Air Act, to certify engines for sale in the U.S. and collect compliance certification fees. Data submitted by manufacturers falls into 12 industries: Heavy Duty Compression Ignition, Marine Spark Ignition, Heavy Duty Spark Ignition, Marine Compression Ignition, Snowmobile, Motorcycle & ATV, Non-Road Compression Ignition, Non-Road Small Spark Ignition, Light-Duty, Evaporative Components, Non-Road Large Spark Ignition, and Locomotive. Title II also requires the collection of fees from manufacturers submitting for compliance certification. Manufacturers submit data on an annual basis, to document engine model changes for certification. Manufacturers also submit compliance information on already certified in-use vehicles randomly selected by the EPA (1) year into their life and (4) years into their life to ensure that emissions systems continue to function appropriately over time.The EPA performs targeted confirmatory tests on approximately 15% of vehicles submitted for certification. Confirmatory data on engines is associated with its corresponding submission data to verify the accuracy of manufacturer submission beyond standard business rules.Section 209 of the 1990 Amendments to the Clean Air Act grants the State of California the authority to set its own standards and perform its own compliance certification through the California Air Resources Board (CARB). Currently manufacturers submit compliance information separately to both the EPA and CARB. Currently, data harmonization occurs between EPA data and CARB data only for Motorcycle & ATV submissions.Submitted data comes in XML format or as documents, with the majority of submissions being sent in XML. Data includes descriptive information on the engine itself, as well as on manufacturer testing methods and results. Submissions may include information (CBI) such as information on estimated sales, new technologies, catalysts and calibration, or other data elements indicated by the submitter as confidential. CBI data is not publically available, but it is available within EPA under the restrictions of the Office of Transportation and Air Quality (OTAQ) CBI policy [RCS Link]. Pollution emission data covers a range of Criteria Air Pollutants (CAPs) including carbon monoxide, hydrocarbons, nitrogen oxides, and particulate matter. Datasets are segmented by vehicle/engine model and year, with corresponding emission, test, and certification data. Data assets are primarily stored in EPA's Verify system. Data collected from the Heavy Duty Compression Ignition, Marine Spark Ignition, Heavy Duty Spark Ignition, Marine Compression Ignition, and Snowmobile industries, however, are currently stored in legacy systems the will be migrated to Verify in the future.Coverage began in 1979, with early records being primarily paper documents that did not go through the same level of validation as the digital submissions that began in 2005.Mobile Source Emissions Compliance documents with metadata, certificate and summary decision information is made available to the public through EPA.gov via the OTAQ Document Index System (http://iaspub.epa.gov/otaqpub/).

  5. o

    QUMPHY MIMIC IV Waveform Database PPG formatted

    • explore.openaire.eu
    Updated Sep 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nando Hegemann (2023). QUMPHY MIMIC IV Waveform Database PPG formatted [Dataset]. http://doi.org/10.5281/zenodo.8370570
    Explore at:
    Dataset updated
    Sep 22, 2023
    Authors
    Nando Hegemann
    Description

    Derivative of the MIMIC IV Waveform Database formatted to be suitable for machine learning. Formatting All records are split into intervals of roughly 60 seconds. The parameter values are averaged over each 60 second interval. The PPG signal data are unprocessed, i.e. as in the original dataset. Intervals with PPG signals containing missing data or large constant data are excluded. PPG signals and signal times are truncated to have the same amount of data points for all records. Formatted data are split into 3 different file types, namely *_n.csv containing the averaged parameter values, *_s.npy containing PPG signal data and t.npy containing the respective signal measurement times. Moreover, formatted data are split into trainXX, validation_* and test_* data files, where the training data trainXX_* are split into multiple files for easier handling. This dataset was created using the following code: https://gitlab.com/qumphy/wp1-benchmark-data-conversion Funding The creation of this dataset has been supported by the European Partnership on Metrology programme 22HLT01 QUMPHY. This project (22HTL01 QUMPHY) has received funding from the EMPIR programme cofinanced by the Participating States and from the European Union’s Horizon 2020 research and innovation programme.

  6. Large format supplies llc USA Import & Buyer Data

    • seair.co.in
    Updated Apr 2, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim (2019). Large format supplies llc USA Import & Buyer Data [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Apr 2, 2019
    Dataset provided by
    Seair Exim Solutions
    Authors
    Seair Exim
    Area covered
    United States
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  7. R

    Table_segmentation Dataset

    • universe.roboflow.com
    zip
    Updated May 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubhayan Sarkar (2023). Table_segmentation Dataset [Dataset]. https://universe.roboflow.com/shubhayan-sarkar-kwzea/table_segmentation/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 19, 2023
    Dataset authored and provided by
    Shubhayan Sarkar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Lines Bounding Boxes
    Description

    Here are a few use cases for this project:

    1. Data Extraction from Complex Documents: The model could be used to segment and extract data from complex documents such as financial statements, invoices or reports. Its ability to identify lines and headers could help in parsing data accurately.

    2. Improvement of Accessibility Features: The model could be deployed in applications for visually impaired people, helping them understand text-based data represented in tables by recognizing and vocally relaying the content of each cell organized by lines and headers.

    3. Automating Data Conversion: The model could be used for automating conversion of printed tables into digital format. It can help in scanning books, research papers or old documents and convert tables in them into easily editable and searchable digital format.

    4. Intelligent Data Analysis Tools: It could be incorporated into a Data Analysis Software to pull out specific table data from a large number of documents, thus making the data analysis process more efficient.

    5. Aid in Educational Settings: The model can be used in educational tools to recognize and interpret table data for online learning systems, making studying more interactive and efficient, especially in subjects where tables are commonly used such as Statistics, Economics, and Sciences.

  8. a

    Downloading King County GIS Elevation Contour Data

    • hub.arcgis.com
    • gis-kingcounty.opendata.arcgis.com
    Updated Mar 3, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    King County (2017). Downloading King County GIS Elevation Contour Data [Dataset]. https://hub.arcgis.com/documents/05396d5bb3de4c598efcabfb397d76eb
    Explore at:
    Dataset updated
    Mar 3, 2017
    Dataset authored and provided by
    King County
    Area covered
    King County
    Description

    Countywide datasets are available as zipped Esri geodatabases. Sets of the 5-foot-interval contours at township-level extents are available as zipped shapefiles in addition to geodatabases. (None of the data are available in GeoJSON or KML format.) Note that the zipped files are exceptionally large.All files are compressed in the open-source 7-Zip format (external link to 7-zip.org). Other utilities which can extract zipped files will work in most cases, but some of these data files might extract with 7-Zip only.

  9. d

    Data from: ARS Water Database

    • catalog.data.gov
    • data.cnra.ca.gov
    • +3more
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). ARS Water Database [Dataset]. https://catalog.data.gov/dataset/ars-water-database-82912
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    The ARS Water Data Base is a collection of precipitation and streamflow data from small agricultural watersheds in the United States. This national archive of variable time-series readings for precipitation and runoff contains sufficient detail to reconstruct storm hydrographs and hyetographs. There are currently about 14,000 station years of data stored in the data base. Watersheds used as study areas range from 0.2 hectare (0.5 acres) to 12,400 square kilometers (4,786 square miles). Raingage networks range from one station per watershed to over 200 stations. The period of record for individual watersheds vary from 1 to 50 years. Some watersheds have been in continuous operation since the mid 1930's. Resources in this dataset:Resource Title: FORMAT INFORMATION FOR VARIOUS RECORD TYPES. File Name: format.txtResource Description: Format information identifying fields and their length will be included in this file for all files except those ending with the extension .txt TYPES OF FILES As indicated in the previous section data has been stored by location number in the form, LXX where XX is the location number. In each subdirectory, there will be various files using the following naming conventions: Runoff data: WSXXX.zip where XXX is the watershed number assigned by the WDC. This number may or may not correspond to a naming convention used in common literature. Rainfall data: RGXXXXXX.zip where XXXXXX is the rain gage station identification. Maximum-minimum daily air temperature: MMTXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Ancillary text files: NOTXXXXX.txt where XXXXX is the watershed number assigned by the WDC. These files will contain textual information including latitude-longitude, name commonly used in literature, acreage, most commonly-associated rain gage(s) (if known by the WDC), a list of all rain gages on or near the watershed. Land use, topography, and soils as known by the WDC. Topographic maps of the watersheds: MAPXXXXX.zip where XXXXX is the location/watershed number assigned by the WDC. Map files are binary TIF files. NOT ALL FILE TYPES MAY BE AVAILABLE FOR SPECIFIC WATERSHEDS. Data files are still being compiled and translated into a form viable for this archive. Please bear with us while we grow.Resource Title: Data Inventory - watersheds. File Name: inventor.txtResource Description: Watersheds at which records of runoff were being collected by the Agricultural Research Service. Variables: Study Location & Number of Rain Gages1; Name; Lat.; Long; Number; Pub. Code; Record Began; Land Use2; Area (Acres); Types of Data3Resource Title: Information about the ARS Water Database. File Name: README.txtResource Title: INDEX TO INFORMATION ON EXPERIMENTAL AGRICULTURAL WATERSHEDS. File Name: INDEX.TXTResource Description: This report includes identification information on all watersheds operated by the ARS. Only some of these are included in the ARS Water Data Base. They are so indicated in the column titled ARS Water Data Base. Other watersheds will not have data available here or through the Water Data Center. This index is particularly important since it relates watershed names with the indexing system used by the Water Data Center. Each location has been assigned a number. The data for that location will be stored in a sub-directory coded as LXX where XX is the location number. The index also indicates the watershed number used by the WDC. Data for a particular watershed will be stored in a compressed file named WSXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Although not included in the index, rain gage information will be stored in compressed files named RGXXXXXX.zip where XXXXXX is a 6-character identification of the rain gage station. The Index also provides information such as latitude-longitude for each of the watersheds, acreage, the period-of-record for each acreage. Multiple entries for a particular watershed will either indicate that the acreage designated for the watershed changed or there was a break in operations of the watershed. Resource Title: ARS Water Database files. File Name: ars_water.zipResource Description: USING THIS SYSTEM Before downloading huge amounts of data from the ARS Water Data Base, you should first review the text files included in this directory. They include: INDEX OF ARS EXPERIMENTAL WATERSHEDS: index.txt This report includes identification information on all watersheds operated by the ARS. Only some of these are included in the ARS Water Data Base. They are so indicated in the column titled ARS Water Data Base. Other watersheds will not have data available here or through the Water Data Center. This index is particularly important since it relates watershed names with the indexing system used by the Water Data Center. Each location has been assigned a number. The data for that location will be stored in a sub-directory coded as LXX where XX is the location number. The index also indicates the watershed number used by the WDC. Data for a particular watershed will be stored in a compressed file named WSXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Although not included in the index, rain gage information will be stored in compressed files named RGXXXXXX.zip where XXXXXX is a 6-character identification of the rain gage station. The Index also provides information such as latitude-longitude for each of the watersheds, acreage, the period-of-record for each acreage. Multiple entries for a particular watershed will either indicate that the acreage designated for the watershed changed or there was a break in operations of the watershed. STATION TABLE FOR THE ARS WATER DATA BASE: station.txt This report indicates the period of record for each recording station represented in the ARS Water Data Base. The data for a particular station will be stored in a single compressed file. FORMAT INFORMATION FOR VARIOUS RECORD TYPES: format.txt Format information identifying fields and their length will be included in this file for all files except those ending with the extension .txt TYPES OF FILES As indicated in the previous section data has been stored by location number in the form, LXX where XX is the location number. In each subdirectory, there will be various files using the following naming conventions: Runoff data: WSXXX.zip where XXX is the watershed number assigned by the WDC. This number may or may not correspond to a naming convention used in common literature. Rainfall data: RGXXXXXX.zip where XXXXXX is the rain gage station identification. Maximum-minimum daily air temperature: MMTXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Ancillary text files: NOTXXXXX.txt where XXXXX is the watershed number assigned by the WDC. These files will contain textual information including latitude-longitude, name commonly used in literature, acreage, most commonly-associated rain gage(s) (if known by the WDC), a list of all rain gages on or near the watershed. Land use, topography, and soils as known by the WDC. Topographic maps of the watersheds: MAPXXXXX.zip where XXXXX is the location/watershed number assigned by the WDC. Map files are binary TIF files. NOT ALL FILE TYPES MAY BE AVAILABLE FOR SPECIFIC WATERSHEDS. Data files are still being compiled and translated into a form viable for this archive. Please bear with us while we grow.

  10. IUST-PDFCorpus

    • zenodo.org
    • live.european-language-grid.eu
    zip
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morteza Zakeri-Nasrabadi; Morteza Zakeri-Nasrabadi (2025). IUST-PDFCorpus [Dataset]. http://doi.org/10.5281/zenodo.3484013
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Morteza Zakeri-Nasrabadi; Morteza Zakeri-Nasrabadi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About

    IUST-PDFCorpus is a large set of various PDF files, aimed at building and manipulating new PDF files, to test, debug, and improve the qualification of real-world PDF readers such as Adobe Acrobat Reader, Foxit Reader, Nitro Reader, MuPDF. IUST-PDFCorpus contains 6,141 PDF complete files in various sizes and contents. The corpus includes 507,299 PDF data objects and 151,132 PDF streams extracted from the set of complete files. Data objects are in the textual format while streams have a binary format and together they make PDF files. In addition, we attached the code coverage of each PDF file when it used as test data in testing MuPDF. The coverage info is available in both binary and XML formats. PDF data objects are organized into three categories. The first category contains all objects in the corpus. Each file in this category holds all PDF objects extracted from one PDF file without any preprocessing. The second category is a dataset made by merging all files in the first category with some preprocessing. The dataset is spilled into train, test and validation set which is useful for using in the machine learning tasks. The third category is the same as the second category but in a smaller size for using in the developing stage of different algorithms. IUST-PDFCorpus is collected from various sources including the Mozilla PDF.js open test corpus, some PDFs which are used in AFL as initial seed, and PDFs gathered from existing e-books, software documents, and public web in different languages. We first introduced IUST-PDFCorpus in our paper “Format-aware learn&fuzz: deep test data generation for efficient fuzzing” where we used it to build an intelligent file format fuzzer, called IUST-DeepFuzz. For the time being, we are gathering other file formats to automate testing of related applications.

    Citing IUST-PDFCorpus

    If IUST-PDFCorpus is used in your work in any form please cite the relevant paper: https://arxiv.org/abs/1812.09961v2

  11. c

    Data from: Lidar - LMCT - WTX WindTracer, Gordon Ridge - Raw Data

    • s.cnmilf.com
    • data.openei.org
    • +1more
    Updated Apr 26, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wind Energy Technologies Office (WETO) (2022). Lidar - LMCT - WTX WindTracer, Gordon Ridge - Raw Data [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/lidar-esrl-windcube-200s-wasco-airport-processed-data
    Explore at:
    Dataset updated
    Apr 26, 2022
    Dataset provided by
    Wind Energy Technologies Office (WETO)
    Description

    Overview Long-range scanning Doppler lidar located on Gordon Ridge. The WindTracer provides high-resolution, long-range lidar data for use in the WFIP2 program. Data Details The system is configured to take data in three different modes. All three modes take 15 minutes to complete and are started at 00, 15, 30, and 45 minutes after the hour. The first nine minutes of the period are spent performing two high-resolution, long-range Plan Position Indicator (PPI) scans at 0.0 and -1.0 degree elevation angles (tilts). These data have file names annotated with HiResPPI noted in the "optional fields" of the file name; for example: lidar.z09.00.20150801.150000.HiResPPI.prd. The next six minutes are spent performing higher altitude PPI scans and Range Height Indicator (RHI) scans. The PPI scans are completed at 6.0- and 30.0-degree elevations, and the RHI scans are completed from below the horizon (down into valleys, as able), up to 40 degrees elevation at 010-, 100-, 190-, and 280-degree azimuths. These files are annotated with PPI-RHI noted in the optional fields of the file name; for example: lidar.z09.00.20150801.150900.PPI-RHI.prd *The last minute is spent measuring a high-altitude vertical wind profile. Generally, this dataset will include data from near ground level up to the top of the planetary boundary layer (PBL), and higher altitude data when high-level cirrus or other clouds are present. The Velocity Azimuth Display (VAD) is measured using six lines of sight at an elevation angle of 75 degrees at azimuth angles of 000, 060, 120, 180, 240, and 300 degrees from True North. The files are annotated with VAD in the optional fields of the file name; for example: lidar.z09.00.20150801.151400.VAD.prd. LMCT does have a data format document that can be provided to users who need programming access to the data. This document is proprietary information but can be supplied to anyone after signing a non-disclosure agreement (NDA). To initiate the NDA process, please contact Keith Barr at keith.barr@lmco.com. The data are not proprietary, only the manual describing the data format. Data Quality Lockheed Martin Coherent Technologies (LMCT) has implemented and refined data quality analysis over the last 14 years, and this installation uses standard data-quality processing procedures. Generally, filtered data products can be accepted as fully data qualified. Secondary processing, such as wind vector analysis, should be used with some caution as the data-quality filters still are "young" and incorrect values can be encountered. Uncertainty Uncertainty in the radial wind measurements (the system's base measurement) varies slightly with range. For most measurements, accuracy of the filtered radial wind measurements have been shown to be within 0.5 m/s with accuracy better than 0.25 m/s not uncommon for ranges less than 10 km. Constraints Doppler lidar is dependent on aerosol loading in the atmosphere, and the signal can be significantly attenuated in precipitation and fog. These weather situations can reduce range performance significantly, and, in heavy rain or thick fog, range performance can be reduced to zero. Long-range performance depends on adequate aerosol loading to provide enough backscattered laser radiation so that a measurement can be made.

  12. Millennium Cohort Study: Linked Health Administrative Data (Scottish Medical...

    • beta.ukdataservice.ac.uk
    • datacatalogue.cessda.eu
    Updated 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCL Institute Of Education University College London (2025). Millennium Cohort Study: Linked Health Administrative Data (Scottish Medical Records), Prescribing Information System, 2009-2015: Secure Access [Dataset]. http://doi.org/10.5255/ukda-sn-8710-1
    Explore at:
    Dataset updated
    2025
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    datacite
    Authors
    UCL Institute Of Education University College London
    Area covered
    Scotland
    Description

    Background:
    The Millennium Cohort Study (MCS) is a large-scale, multi-purpose longitudinal dataset providing information about babies born at the beginning of the 21st century, their progress through life, and the families who are bringing them up, for the four countries of the United Kingdom. The original objectives of the first MCS survey, as laid down in the proposal to the Economic and Social Research Council (ESRC) in March 2000, were:

    • to chart the initial conditions of social, economic and health advantages and disadvantages facing children born at the start of the 21st century, capturing information that the research community of the future will require
    • to provide a basis for comparing patterns of development with the preceding cohorts (the National Child Development Study, held at the UK Data Archive under GN 33004, and the 1970 Birth Cohort Study, held under GN 33229)
    • to collect information on previously neglected topics, such as fathers' involvement in children's care and development
    • to focus on parents as the most immediate elements of the children's 'background', charting their experience as mothers and fathers of newborn babies in the year 2000, recording how they (and any other children in the family) adapted to the newcomer, and what their aspirations for her/his future may be
    • to emphasise intergenerational links including those back to the parents' own childhood
    • to investigate the wider social ecology of the family, including social networks, civic engagement and community facilities and services, splicing in geo-coded data when available
    Additional objectives subsequently included for MCS were:
    • to provide control cases for the national evaluation of Sure Start (a government programme intended to alleviate child poverty and social exclusion)
    • to provide samples of adequate size to analyse and compare the smaller countries of the United Kingdom, and include disadvantaged areas of England

    Further information about the MCS can be found on the Centre for Longitudinal Studies web pages.

    The content of MCS studies, including questions, topics and variables can be explored via the CLOSER Discovery website.

    The first sweep (MCS1) interviewed both mothers and (where resident) fathers (or father-figures) of infants included in the sample when the babies were nine months old, and the second sweep (MCS2) was carried out with the same respondents when the children were three years of age. The third sweep (MCS3) was conducted in 2006, when the children were aged five years old, the fourth sweep (MCS4) in 2008, when they were seven years old, the fifth sweep (MCS5) in 2012-2013, when they were eleven years old, the sixth sweep (MCS6) in 2015, when they were fourteen years old, and the seventh sweep (MCS7) in 2018, when they were seventeen years old.

    End User Licence versions of MCS studies:
    The End User Licence (EUL) versions of MCS1, MCS2, MCS3, MCS4, MCS5, MCS6 and MCS7 are held under UK Data Archive SNs 4683, 5350, 5795, 6411, 7464, 8156 and 8682 respectively. The longitudinal family file is held under SN 8172.

    Sub-sample studies:
    Some studies based on sub-samples of MCS have also been conducted, including a study of MCS respondent mothers who had received assisted fertility treatment, conducted in 2003 (see EUL SN 5559). Also, birth registration and maternity hospital episodes for the MCS respondents are held as a separate dataset (see EUL SN 5614).

    Release of Sweeps 1 to 4 to Long Format (Summer 2020)
    To support longitudinal research and make it easier to compare data from different time points, all data from across all sweeps is now in a consistent format. The update affects the data from sweeps 1 to 4 (from 9 months to 7 years), which are updated from the old/wide to a new/long format to match the format of data of sweeps 5 and 6 (age 11 and 14 sweeps). The old/wide formatted datasets contained one row per family with multiple variables for different respondents. The new/long formatted datasets contain one row per respondent (per parent or per cohort member) for each MCS family. Additional updates have been made to all sweeps to harmonise variable labels and enhance anonymisation.

    How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
    For information on how to access biomedical data from MCS that are not held at the UKDS, see the CLS Genetic data and biological samples webpage.

    Secure Access datasets:
    Secure Access versions of the MCS have more restrictive access conditions than versions available under the standard End User Licence or Special Licence (see 'Access data' tab above).

    Secure Access versions of the MCS include:
    • detailed sensitive variables not available under EUL. These have been grouped thematically and are held under SN 8753 (socio-economic, accommodation and occupational data), SN 8754 (self-reported health, behaviour and fertility), SN 8755 (demographics, language and religion) and SN 8756 (exact participation dates). These files replace previously available studies held under SNs 8456 and 8622-8627
    • detailed geographical identifier files which are grouped by sweep held under SN 7758 (MCS1), SN 7759 (MCS2), SN 7760 (MCS3), SN 7761 (MCS4), SN 7762 (MCS5 2001 Census Boundaries), SN 7763 (MCS5 2011 Census Boundaries), SN 8231 (MCS6 2001 Census Boundaries), SN 8232 (MCS6 2011 Census Boundaries), SN 8757 (MCS7), SN 8758 (MCS7 2001 Census Boundaries) and SN 8759 (MCS7 2011 Census Boundaries). These files replace previously available files grouped by geography SN 7049 (Ward level), SN 7050 (Lower Super Output Area level), and SN 7051 (Output Area level)
    • linked education administrative datasets for Key Stages 1, 2, 4 and 5 held under SN 8481 (England). This replaces previously available datasets for Key Stage 1 (SN 6862) and Key Stage 2 (SN 7712)
    • linked education administrative datasets for Key Stage 1 held under SN 7414 (Scotland)
    • linked education administrative dataset for Key Stages 1, 2, 3 and 4 under SN 9085 (Wales)
    • linked NHS Patient Episode Database for Wales (PEDW) for MCS1 – MCS5 held under SN 8302
    • linked Scottish Medical Records data held under SNs 8709, 8710, 8711, 8712, 8713 and 8714;
    • Banded Distances to English Grammar Schools for MCS5 held under SN 8394
    • linked Health Administrative Datasets (Hospital Episode Statistics) for England for years 2000-2019 held under SN 9030
    • linked Health Administrative Datasets (SAIL) for Wales held under SN 9310
    • linked Hospital of Birth data held under SN 5724.
    The linked education administrative datasets held under SNs 8481,7414 and 9085 may be ordered alongside the MCS detailed geographical identifier files only if sufficient justification is provided in the application.

    Researchers applying for access to the Secure Access MCS datasets should indicate on their ESRC Accredited Researcher application form the EUL dataset(s) that they also wish to access (selected from the MCS Series Access web page).

    The Millennium Cohort Study: Linked Health Administrative Data (Scottish Medical Records), Prescribing Information System, 2009-2015: Secure Access includes data files from the NHS Digital Hospital Episode Statistics database for those cohort members who provided consent to health data linkage in the Age 50 sweep, and had ever lived in Scotland. The Scottish Medical Records database contains information about all hospital admissions in Scotland. This study concerns the Prescribing Information System.

    Other datasets are available from the Scottish Medical Records database, these include:

    • Child Health Reviews (CHR) held under SN 8709
    • Scottish Immunisation and Recall System (SIRS) held under SN 8711
    • Scottish Birth Records (SMR11) held under SN 8712
    • Inpatient and Day Care Attendance (SMR01) held under SN 8713
    • Outpatient Attendance (SMR00) held under SN 8714

    Users should note that linkage to

  13. c

    Movies & TV Shows Metadata Dataset (190K+ Records, Horror-Heavy Collection)

    • crawlfeeds.com
    csv, zip
    Updated Jun 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Movies & TV Shows Metadata Dataset (190K+ Records, Horror-Heavy Collection) [Dataset]. https://crawlfeeds.com/datasets/movies-tv-shows-metadata-dataset-190k-records-horror-heavy-collection
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Jun 22, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    This comprehensive dataset features detailed metadata for over 190,000 movies and TV shows, with a strong concentration in the Horror genre. It is ideal for entertainment research, machine learning models, genre-specific trend analysis, and content recommendation systems.

    Each record contains rich information, making it perfect for streaming platforms, film industry analysts, or academic media researchers.

    Primary Genre Focus: Horror

    Use Cases:

    • Build movie recommendation systems or genre classifiers

    • Train NLP models on movie descriptions

    • Analyze Horror content trends over time

    • Explore box office vs. rating correlations

    • Enrich entertainment datasets with directorial and cast metadata

  14. d

    Electronic Annual Report (eAR) Data for California Water Systems: Processing...

    • search.dataone.org
    • hydroshare.org
    • +1more
    Updated Dec 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erik Porse (2023). Electronic Annual Report (eAR) Data for California Water Systems: Processing Scripts and Formatted Data Files [Dataset]. https://search.dataone.org/view/sha256%3A6762d005d81c61981d10576c5b416a30161e2da33ed427d3f02b210164153355
    Explore at:
    Dataset updated
    Dec 30, 2023
    Dataset provided by
    Hydroshare
    Authors
    Erik Porse
    Time period covered
    Jan 1, 2020 - Dec 31, 2022
    Area covered
    Description

    In California, water systems, submit annual operational data such as demographics, water production, water demand, and retail rates to the State Water Resources Control Board. The State Water Resources Control Board publishes data in a flat file text format (https://www.waterboards.ca.gov/drinking_water/certlic/drinkingwater/ear.html). From 2013-2019, distinct data was published for small and large systems. Since 2020, data is combined in a single file.

    This Hydroshare repository publishes user-friendly versions of the 2020-2022 eAR files, which were created to improve accessibility. Flat files of raw data were formatted to have all questions associated with a water system (PWSID) on one line. This allows for data to be viewed and analyzed in typical worksheet software programs.

    This repository contains 1) Python script templates for parsing the 2020, 2021, and 2022 flat data files, and 2) the formatted eAR data files, saved as an Excel worksheet. There are separate Python scripts for parsing 2020 data and 2021/2022

    Use of the script and files is permitted with attribution. Users are solely responsible for any issues that arise in using or applying data. If any errors are spotted, please contact the author.

  15. Data and Analysis Files Repository: Repurposing Large-Format Microarrays for...

    • zenodo.org
    bin, zip
    Updated Oct 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Denis Cipurko; Denis Cipurko (2024). Data and Analysis Files Repository: Repurposing Large-Format Microarrays for Scalable Spatial Transcriptomics [Dataset]. http://doi.org/10.5281/zenodo.10963424
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Oct 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Denis Cipurko; Denis Cipurko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    May 2024
    Description

    Data and Analysis Files from "Repurposing Large-Format Microarrays for Scalable Spatial Transcriptomics"

    ArraySeq_Method.zip contains the following folder and contents:

    • STARSolo: All code and count matrix output from fastq spatial barcode demultiplexing.
    • Images: All resolution-downsampled H&E image scans from analyzed tissues
    • Space_Ranger: All 10x Space Ranger output from Visium datasets generated in the paper.
    • Analysis: All scripts for analyzing and plotting Array-seq and Visium datasets generated in this paper. Also contains output h5ad files.

    ArraySeq_Barcode_generation_n12.rmd: The script used to generate the Array-seq probes with 12-mer spatial barcodes.

  16. D

    Non Relational Databases Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2024). Non Relational Databases Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/non-relational-databases-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Oct 16, 2024
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Non Relational Databases Market Outlook



    The global market size for non-relational databases is expected to grow from USD 10.5 billion in 2023 to USD 35.2 billion by 2032, registering a Compound Annual Growth Rate (CAGR) of 14.6% over the forecast period. This substantial growth is primarily driven by increasing demand for scalable, flexible database solutions capable of handling diverse data types and large volumes of data generated across various industries.



    One of the significant growth factors for the non-relational databases market is the exponential increase in data generated globally. With the proliferation of Internet of Things (IoT) devices, social media platforms, and digital transactions, the volume of semi-structured and unstructured data is growing at an unprecedented rate. Traditional relational databases often fall short in efficiently managing such data types, making non-relational databases a preferred choice. For example, document-oriented databases like MongoDB allow for the storage of JSON-like documents, offering flexibility in data modeling and retrieval.



    Another key driver is the increasing adoption of non-relational databases among enterprises seeking agile and scalable database solutions. The need for high-performance applications that can scale horizontally and handle large volumes of transactions is pushing businesses to shift from traditional relational databases to non-relational databases. This is particularly evident in sectors like e-commerce, where the ability to manage customer data, product catalogs, and transaction histories in real-time is crucial. Additionally, companies in the BFSI (Banking, Financial Services, and Insurance) sector are leveraging non-relational databases for fraud detection, risk management, and customer relationship management.



    The advent of cloud computing and the growing trend of digital transformation are also significant contributors to the market growth. Cloud-based non-relational databases offer numerous advantages, including reduced infrastructure costs, scalability, and ease of access. As more organizations migrate their operations to the cloud, the demand for cloud-based non-relational databases is set to rise. Moreover, the availability of Database-as-a-Service (DBaaS) offerings from major cloud providers like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) is simplifying the deployment and management of these databases, further driving their adoption.



    Regionally, North America holds the largest market share, driven by the early adoption of advanced technologies and the presence of major market players. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period. The rapid digitalization, growing adoption of cloud services, and increasing investments in IT infrastructure in countries like China and India are propelling the demand for non-relational databases in the region. Additionally, the expanding e-commerce sector and the proliferation of smart devices are further boosting market growth in Asia Pacific.



    Type Analysis



    The non-relational databases market is segmented into several types, including Document-Oriented Databases, Key-Value Stores, Column-Family Stores, Graph Databases, and Others. Each type offers unique functionalities and caters to specific use cases, making them suitable for different industry requirements. Document-Oriented Databases, such as MongoDB and CouchDB, store data in document format (e.g., JSON or BSON), allowing for flexible schema designs and efficient data retrieval. These databases are widely used in content management systems, e-commerce platforms, and real-time analytics applications due to their ability to handle semi-structured data.



    Key-Value Stores, such as Redis and Amazon DynamoDB, store data as key-value pairs, providing extremely fast read and write operations. These databases are ideal for caching, session management, and real-time applications where speed is critical. They offer horizontal scalability and are highly efficient in managing large volumes of data with simple query requirements. The simplicity of the key-value data model and its performance benefits make it a popular choice for high-throughput applications.



    Column-Family Stores, such as Apache Cassandra and HBase, store data in columns rather than rows, allowing for efficient storage and retrieval of large datasets. These databases are designed to handle massive amounts of data across distributed systems, making them suitable for use cases involving big data analytics, time-seri

  17. Data from: Observed and naturalized discharge data for large Siberian rivers...

    • data.ucar.edu
    • arcticdata.io
    • +1more
    ascii
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander I. Shiklomanov; Daqing Yang (2024). Observed and naturalized discharge data for large Siberian rivers [Dataset]. https://data.ucar.edu/dataset/observed-and-naturalized-discharge-data-for-large-siberian-rivers
    Explore at:
    asciiAvailable download formats
    Dataset updated
    Feb 7, 2024
    Dataset provided by
    University Corporation for Atmospheric Research
    Authors
    Alexander I. Shiklomanov; Daqing Yang
    Time period covered
    Jun 28, 1902 - Dec 31, 2009
    Area covered
    Description

    The data set includes two types of discharge data: 1) observed daily discharge values compiled in the State Hydrological Institute, Russia from official sources and 2) modeled "naturalized" daily discharge. The "naturalized" discharge means discharge values with excluded human impact. The data can be used in hydro-climatological analysis to understand interactions between climate and hydrology. A specially developed Hydrograph Transformation Model (HTM) was used to eliminate effects of reservoirs and other human impact from discharge records. These data are formatted as text documents.

  18. Z

    Data for "RegulaTome: a corpus of typed, directed, and signed relations...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nastou, Katerina (2024). Data for "RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10808329
    Explore at:
    Dataset updated
    Apr 23, 2024
    Dataset authored and provided by
    Nastou, Katerina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    RegulaTome corpus: this file contains the RegulaTome corpus in BRAT format. The directory "splits" has the corpus split based on the train/dev/test used for the training of the relation extraction system

    RegulaTome annodoc: The annotation guidelines along with the annotation configuration files for BRAT are provided in annodoc+config.tar.gz. The online version of the annotation documentation can be found here: https://katnastou.github.io/regulatome-annodoc/

    The tagger software can be found here: https://github.com/larsjuhljensen/tagger. The command used to run tagger before large-scale execution of the RE system is:

    gzip -cd ls -1 pmc/*.en.merged.filtered.tsv.gz ls -1r pubmed/*.tsv.gz | cat dictionary/excluded_documents.txt - | tagger/tagcorpus --threads=16 --autodetect --types=dictionary/curated_types.tsv --entities=dictionary/all_entities.tsv --names=dictionary/all_names_textmining.tsv --groups=dictionary/all_groups.tsv --stopwords=dictionary/all_global.tsv --local-stopwords=dictionary/all_local.tsv --type-pairs=dictionary/all_type_pairs.tsv --out-matches=all_matches.tsv

    Input documents for large-scale execution, which is done on entire PubMed (as of March 2024) and PMC Open Access (as of November 2023) articles in BioC format. The files are converted to a tab-delimited format to be compatible with the RE system input (see below).

    Input dictionary files: all the files necessary to execute the command above are available in tagger_dictionary_files.tar.gz

    Tagger output: we filter the results of the tagger run down to gene/protein hits, and documents with more than 1 hit (since we are doing relation extraction) before feeding it to our RE system. The filtered output is available in tagger_matches_ggp_only_gt_1_hit.tsv.gz

    Relation extraction system input: combined_input_for_re.tar.gz: these are the directories with all the .ann and .txt files used as input for the large-scale execution of the relation extraction pipeline. The files are generated from the tagger tsv output (see above, tagger_matches_ggp_only_gt_1_hit.tsv.gz) using the tagger2standoff.py script from the string-db-tools repository.

    Relation extraction models. The Transformer-based model used for large-scale relation extraction and prediction on the test set is at relation_extraction_multi-label-best_model.tar.gz

    The pre-trained RoBERTa model on PubMed and PMC and MIMIC-III with a BPE Vocab learned from PubMed (RoBERTa-large-PM-M3-Voc), which is used by our system is available here.

    Relation extraction system output: the tab-delimited outputs of the relation extraction system are found at large_scale_relation_extraction_results.tar.gz !!!ATTENTION this file is approximately 1TB in size, so make sure you have enough space to download it on your machine!!!

    The relation extraction system output files have 86 columns: PMID, Entity BRAT ID1, Entity BRAT ID2, and scores per class produced by the relation extraction model. Each file has a header to denote which score is in which column.

  19. Z

    Three-component modelling of C-rich AGB-star winds V. – dataset

    • data.niaid.nih.gov
    Updated Oct 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mattsson, Lars (2020). Three-component modelling of C-rich AGB-star winds V. – dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3999342
    Explore at:
    Dataset updated
    Oct 19, 2020
    Dataset provided by
    Sandin, Christer
    Mattsson, Lars
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The provided data include all parameter files, binary output files, and log files that are the basis for the publication in MNRAS.

    The file 'file_listing.txt' contains a complete list of files and directories in all gzipped tar files. Each individual gzipped tar file is formatted as follows:

    Mm.m_Ll.ll_Ttttt_CtOc.cc.tar.gz

    where m.m :: the assumed mass of the model, in solar masses l.ll :: The assumed luminosity, in log10(solar luminosities) tttt :: The effective temperature of the star, in Kelvin. c.cc :: The carbon-to-oxygen excess, in log10(n_C/n_H-n_O/n_H)+12

    The contents vary according to the model, but here is the general directory structure:

    nodr/ :: non-drift / PC models drift/ :: drift models

    nodr/init drift/init :: Initial model files created using John Connor.

    File suffixes are the following:

    .par :: Plain-text parameter file that contains all parameters that are different from the respective default value in the model. Consequently, to see all used parameters it is necessary to look in the log file (see below).

    .bin :: Binary file that contains converged models. Each model is stored in two versions, first the previous time step and then the current time step (both are needed to restart model calculations at that time step).

       The initial model file only contains one model; where the previous
     time step data are the same as the current time step data.
    
    
       The format of this file is explained below.
    
    
       Note! These files can get pretty large and are therefore only
      available for a smaller number of the models here. Please ask the
      corresponding author for the missing files should the need appear.
    

    .log :: Plain-text log file that shows the used model parameters and a number of key properties for each converged model. The encoding of this file is UTF-8.

    .inf :: Plain-text secondary log file that contains the header of the [primary] log file as well as timing information.

    .tpb :: Secondary binary file that contains a number of properties specified at the outer boundary, typically for each consecutive time step.

    .lis :: Plain-text file with the iteration history. Available for some files.

    .liv :: Plain-text file with values specified for a number of properties at each gridpoint. Available for a smaller number of files.

    .inp :: Plain-text file that is used to launch a model; some are still there.

    .eps :: Encapsulated PostScript files created by John Connor when calculating the initial model.

    Model evolution structure - file endings before the suffix:

    _rlx :: Files related to relaxing the T-800 calculations on the initial model created by John Connor.

    _exp :: Files related to expanding the initially compact model to using the full radial domain.

    _fix :: Files related to the intermediate stage where calculations are changed from expansion to outflow.

    _out :: Files related to the outflow stage of the calculations; this is what you want to look at to see the wind evolution. Results in the paper are calculated using these data.

    Note! Some outflow stage calculations continue the evolution of the previous set of files. The underlying reason for continued calculations is typically that the calculated time interval is too short. Such files are typically given the extension '_cont.lin_out', '_cont2.lin_out', etc.

    Load files:

    Two tools are provided here that can load the binary data files using the Interactive Data Language (IDL):

    sc_load_bin (for files with the suffix '.bin'):

    Loads the full content of a T-800 binary file and returns a structure
    with the data.
    

    sc_load_tpb (for files with the suffix '.tpb'):

    Loads the full content of a T-800 'tpb' binary file and returns a
    structure with the data.
    
    
    Note! Due to the way models run on clusters, this file is sometimes
     incomplete; this happens when the model code T-800 is stopped as the
     cluster-specific walltime is reached. If this is the case, it is
     necessary to use the binary file instead, where data are saved
     typically every 20:th time step.
    

    Alternative tools for use with Python and Julia could be considered for writing, but where not yet available when this dataset was made public. Please contact the corresponding author for a current status on this issue.

  20. L

    Legal Document Checking and Formatting Software Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Legal Document Checking and Formatting Software Report [Dataset]. https://www.archivemarketresearch.com/reports/legal-document-checking-and-formatting-software-19886
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Market Size and Growth: The global legal document checking and formatting software market is projected to reach a value of X million by 2033, expanding at a CAGR of XX% from 2025 to 2033. This surge is attributed to the increasing need for accuracy, efficiency, and consistency in legal document preparation. The increasing adoption of cloud-based solutions and the rise of artificial intelligence (AI) and machine learning (ML) technologies are further driving market growth. Market Trends and Restraints: Key trends shaping the market include the adoption of cloud-based platforms for enhanced accessibility and collaboration, the use of AI and ML for automated document analysis and formatting, and the growing demand for industry-specific solutions. Market restraints include the cost of implementation, data security concerns, and the need for skilled professionals to manage and interpret the software's output. The market is also segmented by type (on-premise and cloud-based) and application (large enterprises and SMEs), with large enterprises holding a dominant market share due to their need for robust document management systems.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Market Report Analytics (2025). Document Formatting Service Report [Dataset]. https://www.marketreportanalytics.com/reports/document-formatting-service-75560

Document Formatting Service Report

Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
Apr 10, 2025
Dataset authored and provided by
Market Report Analytics
License

https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description

The global document formatting services market is experiencing robust growth, driven by the increasing demand for professionally formatted documents across various sectors. The market's expansion is fueled by several key factors. Firstly, the proliferation of digital documents and the need for consistent branding and professional presentation across all communication channels are driving demand. Secondly, the rising complexity of document creation, particularly in fields like legal and finance, necessitates specialized formatting expertise. Businesses are increasingly outsourcing this function to focus on core competencies, leading to significant market expansion. The academic sector also contributes substantially, with students and researchers requiring formatting assistance for theses, dissertations, and research papers. While specific market size figures aren't provided, considering the growth in related sectors like digital publishing and freelance editing, a reasonable estimation for the 2025 market size could be around $2.5 billion, growing at a conservative Compound Annual Growth Rate (CAGR) of 10% over the forecast period (2025-2033). This growth is largely segmented across different application areas, with the business and legal sectors showing particularly strong demand. The service itself is divided across document types, with Word documents, PowerPoint presentations, and Excel spreadsheets representing the largest shares. North America and Europe currently hold the largest market shares, but growth potential is high in the Asia-Pacific region, driven by burgeoning economies and increased digital adoption. Despite its growth trajectory, the market faces some challenges. Competition amongst numerous providers, ranging from large outsourcing firms to individual freelancers, can lead to price pressure. The need for specialized expertise within specific document formatting standards (e.g., legal citations) requires continuous investment in training and upskilling. Moreover, concerns about data security and confidentiality within client documents are areas that providers must address effectively. The evolving technological landscape, with the potential introduction of more advanced automated formatting tools, also represents a long-term challenge. However, the ongoing demand for high-quality, error-free documentation suggests that human-driven expertise in document formatting will remain highly relevant and in demand for the foreseeable future.

Search
Clear search
Close search
Google apps
Main menu