97 datasets found

D
Document Formatting Service Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Document Formatting Service Report [Dataset]. https://www.marketreportanalytics.com/reports/document-formatting-service-75560
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
Apr 10, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global document formatting services market is experiencing robust growth, driven by the increasing demand for professionally formatted documents across various sectors. The market's expansion is fueled by several key factors. Firstly, the proliferation of digital documents and the need for consistent branding and professional presentation across all communication channels are driving demand. Secondly, the rising complexity of document creation, particularly in fields like legal and finance, necessitates specialized formatting expertise. Businesses are increasingly outsourcing this function to focus on core competencies, leading to significant market expansion. The academic sector also contributes substantially, with students and researchers requiring formatting assistance for theses, dissertations, and research papers. While specific market size figures aren't provided, considering the growth in related sectors like digital publishing and freelance editing, a reasonable estimation for the 2025 market size could be around $2.5 billion, growing at a conservative Compound Annual Growth Rate (CAGR) of 10% over the forecast period (2025-2033). This growth is largely segmented across different application areas, with the business and legal sectors showing particularly strong demand. The service itself is divided across document types, with Word documents, PowerPoint presentations, and Excel spreadsheets representing the largest shares. North America and Europe currently hold the largest market shares, but growth potential is high in the Asia-Pacific region, driven by burgeoning economies and increased digital adoption. Despite its growth trajectory, the market faces some challenges. Competition amongst numerous providers, ranging from large outsourcing firms to individual freelancers, can lead to price pressure. The need for specialized expertise within specific document formatting standards (e.g., legal citations) requires continuous investment in training and upskilling. Moreover, concerns about data security and confidentiality within client documents are areas that providers must address effectively. The evolving technological landscape, with the potential introduction of more advanced automated formatting tools, also represents a long-term challenge. However, the ongoing demand for high-quality, error-free documentation suggests that human-driven expertise in document formatting will remain highly relevant and in demand for the foreseeable future.
d
Argos Satellite Tracking Data for Thick-billed Murres (Uria lomvia) - Raw...
catalog.data.gov
data.usgs.gov
Updated Nov 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Argos Satellite Tracking Data for Thick-billed Murres (Uria lomvia) - Raw Data [Dataset]. https://catalog.data.gov/dataset/argos-satellite-tracking-data-for-thick-billed-murres-uria-lomvia-raw-data
Explore at:
Dataset updated
Nov 10, 2024
Dataset provided by
U.S. Geological Survey
Description
This metadata document describes the data contained in the "rawData" folder of this data package. This data package contains all data collected by the Argos System from 20 satellite transmitters attached to Thick-billed murres on their breeding range in arctic and western Alaska, 1995-1996. Five data files are included in the "rawData" folder of this data package. Two data files (with identical content) contain the raw Argos DIAG (Diagnostic) data, one in the legacy verbose ASCII format and one in a tabular Comma Separate Value (CSV) format. Two other data files (with identical content) contain the raw Argos DS (Dispose) data, one in the legacy verbose ASCII format and one in a tabular CSV format. The fifth file, "deploymentAttributes", contains one record for each transmitter deployment in a CSV formatted table. The deployment attributes file contains information such as when the transmitter was attached to the animal, when tracking of a live animal ended, and a variety of variables describing the animal and transmitter. This table is identical to the "deploymentAttributes" table in the "processedData" folder of this data package.
Data from: A large synthetic dataset for machine learning applications in...
zenodo.org
csv, json, png, zip
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
Explore at:
zip, png, csv, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13378476
Dataset updated
Mar 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

Data generation algorithm

The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

Network

The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

Time series

The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

Usage

The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

Selecting a particular country

This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

import pandas as pd CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

CH_gens_list = CH_gens.dropna().squeeze().to_list()

Finally, we can import all the time series of Swiss generators from a given data table with

pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

Averaging over time

This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

hourly_loads = pd.read_csv('loads_2018_3.csv')

To get a daily average of the loads, we can use:

daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

Source code

The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

Funding

This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
Mobile Source Emissions Regulatory Compliance Data Inventory
catalog.data.gov
cloud.csiss.gmu.edu
+2more
Updated Nov 30, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Air and Radiation (OAR) - Office of Transportation and Air Quality (OTAQ) (2020). Mobile Source Emissions Regulatory Compliance Data Inventory [Dataset]. https://catalog.data.gov/dataset/mobile-source-emissions-regulatory-compliance-data-inventory
Explore at:
Dataset updated
Nov 30, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The Mobile Source Emissions Regulatory Compliance Data Inventory data asset contains measured summary compliance information on light-duty, heavy-duty, and non-road engine manufacturers by model, as well as fee payment data required by Title II of the 1990 Amendments to the Clean Air Act, to certify engines for sale in the U.S. and collect compliance certification fees. Data submitted by manufacturers falls into 12 industries: Heavy Duty Compression Ignition, Marine Spark Ignition, Heavy Duty Spark Ignition, Marine Compression Ignition, Snowmobile, Motorcycle & ATV, Non-Road Compression Ignition, Non-Road Small Spark Ignition, Light-Duty, Evaporative Components, Non-Road Large Spark Ignition, and Locomotive. Title II also requires the collection of fees from manufacturers submitting for compliance certification. Manufacturers submit data on an annual basis, to document engine model changes for certification. Manufacturers also submit compliance information on already certified in-use vehicles randomly selected by the EPA (1) year into their life and (4) years into their life to ensure that emissions systems continue to function appropriately over time.The EPA performs targeted confirmatory tests on approximately 15% of vehicles submitted for certification. Confirmatory data on engines is associated with its corresponding submission data to verify the accuracy of manufacturer submission beyond standard business rules.Section 209 of the 1990 Amendments to the Clean Air Act grants the State of California the authority to set its own standards and perform its own compliance certification through the California Air Resources Board (CARB). Currently manufacturers submit compliance information separately to both the EPA and CARB. Currently, data harmonization occurs between EPA data and CARB data only for Motorcycle & ATV submissions.Submitted data comes in XML format or as documents, with the majority of submissions being sent in XML. Data includes descriptive information on the engine itself, as well as on manufacturer testing methods and results. Submissions may include information (CBI) such as information on estimated sales, new technologies, catalysts and calibration, or other data elements indicated by the submitter as confidential. CBI data is not publically available, but it is available within EPA under the restrictions of the Office of Transportation and Air Quality (OTAQ) CBI policy [RCS Link]. Pollution emission data covers a range of Criteria Air Pollutants (CAPs) including carbon monoxide, hydrocarbons, nitrogen oxides, and particulate matter. Datasets are segmented by vehicle/engine model and year, with corresponding emission, test, and certification data. Data assets are primarily stored in EPA's Verify system. Data collected from the Heavy Duty Compression Ignition, Marine Spark Ignition, Heavy Duty Spark Ignition, Marine Compression Ignition, and Snowmobile industries, however, are currently stored in legacy systems the will be migrated to Verify in the future.Coverage began in 1979, with early records being primarily paper documents that did not go through the same level of validation as the digital submissions that began in 2005.Mobile Source Emissions Compliance documents with metadata, certificate and summary decision information is made available to the public through EPA.gov via the OTAQ Document Index System (http://iaspub.epa.gov/otaqpub/).
o
QUMPHY MIMIC IV Waveform Database PPG formatted
explore.openaire.eu
Updated Sep 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nando Hegemann (2023). QUMPHY MIMIC IV Waveform Database PPG formatted [Dataset]. http://doi.org/10.5281/zenodo.8370570
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.8370570
Dataset updated
Sep 22, 2023
Authors
Nando Hegemann
Description
Derivative of the MIMIC IV Waveform Database formatted to be suitable for machine learning. Formatting All records are split into intervals of roughly 60 seconds. The parameter values are averaged over each 60 second interval. The PPG signal data are unprocessed, i.e. as in the original dataset. Intervals with PPG signals containing missing data or large constant data are excluded. PPG signals and signal times are truncated to have the same amount of data points for all records. Formatted data are split into 3 different file types, namely *_n.csv containing the averaged parameter values, *_s.npy containing PPG signal data and t.npy containing the respective signal measurement times. Moreover, formatted data are split into trainXX, validation_* and test_* data files, where the training data trainXX_* are split into multiple files for easier handling. This dataset was created using the following code: https://gitlab.com/qumphy/wp1-benchmark-data-conversion Funding The creation of this dataset has been supported by the European Partnership on Metrology programme 22HLT01 QUMPHY. This project (22HTL01 QUMPHY) has received funding from the EMPIR programme cofinanced by the Participating States and from the European Union’s Horizon 2020 research and innovation programme.
Large format supplies llc USA Import & Buyer Data
seair.co.in
Updated Apr 2, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim (2019). Large format supplies llc USA Import & Buyer Data [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset updated
Apr 2, 2019
Dataset provided by
Seair Exim Solutions
Authors
Seair Exim
Area covered
United States
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
R
Table_segmentation Dataset
universe.roboflow.com
zip
Updated May 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubhayan Sarkar (2023). Table_segmentation Dataset [Dataset]. https://universe.roboflow.com/shubhayan-sarkar-kwzea/table_segmentation/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
May 19, 2023
Dataset authored and provided by
Shubhayan Sarkar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Lines Bounding Boxes
Description
Here are a few use cases for this project:

Data Extraction from Complex Documents: The model could be used to segment and extract data from complex documents such as financial statements, invoices or reports. Its ability to identify lines and headers could help in parsing data accurately.

Improvement of Accessibility Features: The model could be deployed in applications for visually impaired people, helping them understand text-based data represented in tables by recognizing and vocally relaying the content of each cell organized by lines and headers.

Automating Data Conversion: The model could be used for automating conversion of printed tables into digital format. It can help in scanning books, research papers or old documents and convert tables in them into easily editable and searchable digital format.

Intelligent Data Analysis Tools: It could be incorporated into a Data Analysis Software to pull out specific table data from a large number of documents, thus making the data analysis process more efficient.

Aid in Educational Settings: The model can be used in educational tools to recognize and interpret table data for online learning systems, making studying more interactive and efficient, especially in subjects where tables are commonly used such as Statistics, Economics, and Sciences.
a
Downloading King County GIS Elevation Contour Data
hub.arcgis.com
gis-kingcounty.opendata.arcgis.com
Updated Mar 3, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
King County (2017). Downloading King County GIS Elevation Contour Data [Dataset]. https://hub.arcgis.com/documents/05396d5bb3de4c598efcabfb397d76eb
Explore at:
Dataset updated
Mar 3, 2017
Dataset authored and provided by
King County
Area covered
King County
Description
Countywide datasets are available as zipped Esri geodatabases. Sets of the 5-foot-interval contours at township-level extents are available as zipped shapefiles in addition to geodatabases. (None of the data are available in GeoJSON or KML format.) Note that the zipped files are exceptionally large.All files are compressed in the open-source 7-Zip format (external link to 7-zip.org). Other utilities which can extract zipped files will work in most cases, but some of these data files might extract with 7-Zip only.
d
Data from: ARS Water Database
catalog.data.gov
data.cnra.ca.gov
+3more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). ARS Water Database [Dataset]. https://catalog.data.gov/dataset/ars-water-database-82912
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
The ARS Water Data Base is a collection of precipitation and streamflow data from small agricultural watersheds in the United States. This national archive of variable time-series readings for precipitation and runoff contains sufficient detail to reconstruct storm hydrographs and hyetographs. There are currently about 14,000 station years of data stored in the data base. Watersheds used as study areas range from 0.2 hectare (0.5 acres) to 12,400 square kilometers (4,786 square miles). Raingage networks range from one station per watershed to over 200 stations. The period of record for individual watersheds vary from 1 to 50 years. Some watersheds have been in continuous operation since the mid 1930's. Resources in this dataset:Resource Title: FORMAT INFORMATION FOR VARIOUS RECORD TYPES. File Name: format.txtResource Description: Format information identifying fields and their length will be included in this file for all files except those ending with the extension .txt TYPES OF FILES As indicated in the previous section data has been stored by location number in the form, LXX where XX is the location number. In each subdirectory, there will be various files using the following naming conventions: Runoff data: WSXXX.zip where XXX is the watershed number assigned by the WDC. This number may or may not correspond to a naming convention used in common literature. Rainfall data: RGXXXXXX.zip where XXXXXX is the rain gage station identification. Maximum-minimum daily air temperature: MMTXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Ancillary text files: NOTXXXXX.txt where XXXXX is the watershed number assigned by the WDC. These files will contain textual information including latitude-longitude, name commonly used in literature, acreage, most commonly-associated rain gage(s) (if known by the WDC), a list of all rain gages on or near the watershed. Land use, topography, and soils as known by the WDC. Topographic maps of the watersheds: MAPXXXXX.zip where XXXXX is the location/watershed number assigned by the WDC. Map files are binary TIF files. NOT ALL FILE TYPES MAY BE AVAILABLE FOR SPECIFIC WATERSHEDS. Data files are still being compiled and translated into a form viable for this archive. Please bear with us while we grow.Resource Title: Data Inventory - watersheds. File Name: inventor.txtResource Description: Watersheds at which records of runoff were being collected by the Agricultural Research Service. Variables: Study Location & Number of Rain Gages1; Name; Lat.; Long; Number; Pub. Code; Record Began; Land Use2; Area (Acres); Types of Data3Resource Title: Information about the ARS Water Database. File Name: README.txtResource Title: INDEX TO INFORMATION ON EXPERIMENTAL AGRICULTURAL WATERSHEDS. File Name: INDEX.TXTResource Description: This report includes identification information on all watersheds operated by the ARS. Only some of these are included in the ARS Water Data Base. They are so indicated in the column titled ARS Water Data Base. Other watersheds will not have data available here or through the Water Data Center. This index is particularly important since it relates watershed names with the indexing system used by the Water Data Center. Each location has been assigned a number. The data for that location will be stored in a sub-directory coded as LXX where XX is the location number. The index also indicates the watershed number used by the WDC. Data for a particular watershed will be stored in a compressed file named WSXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Although not included in the index, rain gage information will be stored in compressed files named RGXXXXXX.zip where XXXXXX is a 6-character identification of the rain gage station. The Index also provides information such as latitude-longitude for each of the watersheds, acreage, the period-of-record for each acreage. Multiple entries for a particular watershed will either indicate that the acreage designated for the watershed changed or there was a break in operations of the watershed. Resource Title: ARS Water Database files. File Name: ars_water.zipResource Description: USING THIS SYSTEM Before downloading huge amounts of data from the ARS Water Data Base, you should first review the text files included in this directory. They include: INDEX OF ARS EXPERIMENTAL WATERSHEDS: index.txt This report includes identification information on all watersheds operated by the ARS. Only some of these are included in the ARS Water Data Base. They are so indicated in the column titled ARS Water Data Base. Other watersheds will not have data available here or through the Water Data Center. This index is particularly important since it relates watershed names with the indexing system used by the Water Data Center. Each location has been assigned a number. The data for that location will be stored in a sub-directory coded as LXX where XX is the location number. The index also indicates the watershed number used by the WDC. Data for a particular watershed will be stored in a compressed file named WSXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Although not included in the index, rain gage information will be stored in compressed files named RGXXXXXX.zip where XXXXXX is a 6-character identification of the rain gage station. The Index also provides information such as latitude-longitude for each of the watersheds, acreage, the period-of-record for each acreage. Multiple entries for a particular watershed will either indicate that the acreage designated for the watershed changed or there was a break in operations of the watershed. STATION TABLE FOR THE ARS WATER DATA BASE: station.txt This report indicates the period of record for each recording station represented in the ARS Water Data Base. The data for a particular station will be stored in a single compressed file. FORMAT INFORMATION FOR VARIOUS RECORD TYPES: format.txt Format information identifying fields and their length will be included in this file for all files except those ending with the extension .txt TYPES OF FILES As indicated in the previous section data has been stored by location number in the form, LXX where XX is the location number. In each subdirectory, there will be various files using the following naming conventions: Runoff data: WSXXX.zip where XXX is the watershed number assigned by the WDC. This number may or may not correspond to a naming convention used in common literature. Rainfall data: RGXXXXXX.zip where XXXXXX is the rain gage station identification. Maximum-minimum daily air temperature: MMTXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Ancillary text files: NOTXXXXX.txt where XXXXX is the watershed number assigned by the WDC. These files will contain textual information including latitude-longitude, name commonly used in literature, acreage, most commonly-associated rain gage(s) (if known by the WDC), a list of all rain gages on or near the watershed. Land use, topography, and soils as known by the WDC. Topographic maps of the watersheds: MAPXXXXX.zip where XXXXX is the location/watershed number assigned by the WDC. Map files are binary TIF files. NOT ALL FILE TYPES MAY BE AVAILABLE FOR SPECIFIC WATERSHEDS. Data files are still being compiled and translated into a form viable for this archive. Please bear with us while we grow.
IUST-PDFCorpus
zenodo.org
live.european-language-grid.eu
zip
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morteza Zakeri-Nasrabadi; Morteza Zakeri-Nasrabadi (2025). IUST-PDFCorpus [Dataset]. http://doi.org/10.5281/zenodo.3484013
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3484013
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Morteza Zakeri-Nasrabadi; Morteza Zakeri-Nasrabadi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About

IUST-PDFCorpus is a large set of various PDF files, aimed at building and manipulating new PDF files, to test, debug, and improve the qualification of real-world PDF readers such as Adobe Acrobat Reader, Foxit Reader, Nitro Reader, MuPDF. IUST-PDFCorpus contains 6,141 PDF complete files in various sizes and contents. The corpus includes 507,299 PDF data objects and 151,132 PDF streams extracted from the set of complete files. Data objects are in the textual format while streams have a binary format and together they make PDF files. In addition, we attached the code coverage of each PDF file when it used as test data in testing MuPDF. The coverage info is available in both binary and XML formats. PDF data objects are organized into three categories. The first category contains all objects in the corpus. Each file in this category holds all PDF objects extracted from one PDF file without any preprocessing. The second category is a dataset made by merging all files in the first category with some preprocessing. The dataset is spilled into train, test and validation set which is useful for using in the machine learning tasks. The third category is the same as the second category but in a smaller size for using in the developing stage of different algorithms. IUST-PDFCorpus is collected from various sources including the Mozilla PDF.js open test corpus, some PDFs which are used in AFL as initial seed, and PDFs gathered from existing e-books, software documents, and public web in different languages. We first introduced IUST-PDFCorpus in our paper “Format-aware learn&fuzz: deep test data generation for efficient fuzzing” where we used it to build an intelligent file format fuzzer, called IUST-DeepFuzz. For the time being, we are gathering other file formats to automate testing of related applications.

Citing IUST-PDFCorpus

If IUST-PDFCorpus is used in your work in any form please cite the relevant paper: https://arxiv.org/abs/1812.09961v2
c
Data from: Lidar - LMCT - WTX WindTracer, Gordon Ridge - Raw Data
s.cnmilf.com
data.openei.org
+1more
Updated Apr 26, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wind Energy Technologies Office (WETO) (2022). Lidar - LMCT - WTX WindTracer, Gordon Ridge - Raw Data [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/lidar-esrl-windcube-200s-wasco-airport-processed-data
Explore at:
Dataset updated
Apr 26, 2022
Dataset provided by
Wind Energy Technologies Office (WETO)
Description
Overview Long-range scanning Doppler lidar located on Gordon Ridge. The WindTracer provides high-resolution, long-range lidar data for use in the WFIP2 program. Data Details The system is configured to take data in three different modes. All three modes take 15 minutes to complete and are started at 00, 15, 30, and 45 minutes after the hour. The first nine minutes of the period are spent performing two high-resolution, long-range Plan Position Indicator (PPI) scans at 0.0 and -1.0 degree elevation angles (tilts). These data have file names annotated with HiResPPI noted in the "optional fields" of the file name; for example: lidar.z09.00.20150801.150000.HiResPPI.prd. The next six minutes are spent performing higher altitude PPI scans and Range Height Indicator (RHI) scans. The PPI scans are completed at 6.0- and 30.0-degree elevations, and the RHI scans are completed from below the horizon (down into valleys, as able), up to 40 degrees elevation at 010-, 100-, 190-, and 280-degree azimuths. These files are annotated with PPI-RHI noted in the optional fields of the file name; for example: lidar.z09.00.20150801.150900.PPI-RHI.prd *The last minute is spent measuring a high-altitude vertical wind profile. Generally, this dataset will include data from near ground level up to the top of the planetary boundary layer (PBL), and higher altitude data when high-level cirrus or other clouds are present. The Velocity Azimuth Display (VAD) is measured using six lines of sight at an elevation angle of 75 degrees at azimuth angles of 000, 060, 120, 180, 240, and 300 degrees from True North. The files are annotated with VAD in the optional fields of the file name; for example: lidar.z09.00.20150801.151400.VAD.prd. LMCT does have a data format document that can be provided to users who need programming access to the data. This document is proprietary information but can be supplied to anyone after signing a non-disclosure agreement (NDA). To initiate the NDA process, please contact Keith Barr at keith.barr@lmco.com. The data are not proprietary, only the manual describing the data format. Data Quality Lockheed Martin Coherent Technologies (LMCT) has implemented and refined data quality analysis over the last 14 years, and this installation uses standard data-quality processing procedures. Generally, filtered data products can be accepted as fully data qualified. Secondary processing, such as wind vector analysis, should be used with some caution as the data-quality filters still are "young" and incorrect values can be encountered. Uncertainty Uncertainty in the radial wind measurements (the system's base measurement) varies slightly with range. For most measurements, accuracy of the filtered radial wind measurements have been shown to be within 0.5 m/s with accuracy better than 0.25 m/s not uncommon for ranges less than 10 km. Constraints Doppler lidar is dependent on aerosol loading in the atmosphere, and the signal can be significantly attenuated in precipitation and fog. These weather situations can reduce range performance significantly, and, in heavy rain or thick fog, range performance can be reduced to zero. Long-range performance depends on adequate aerosol loading to provide enough backscattered laser radiation so that a measurement can be made.
Millennium Cohort Study: Linked Health Administrative Data (Scottish Medical...
beta.ukdataservice.ac.uk
datacatalogue.cessda.eu
Updated 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCL Institute Of Education University College London (2025). Millennium Cohort Study: Linked Health Administrative Data (Scottish Medical Records), Prescribing Information System, 2009-2015: Secure Access [Dataset]. http://doi.org/10.5255/ukda-sn-8710-1
Explore at:
Unique identifier
https://doi.org/10.5255/ukda-sn-8710-1
Dataset updated
2025
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
datacite
Authors
UCL Institute Of Education University College London
Area covered
Scotland
Description
Background:
The Millennium Cohort Study (MCS) is a large-scale, multi-purpose longitudinal dataset providing information about babies born at the beginning of the 21st century, their progress through life, and the families who are bringing them up, for the four countries of the United Kingdom. The original objectives of the first MCS survey, as laid down in the proposal to the Economic and Social Research Council (ESRC) in March 2000, were:
to chart the initial conditions of social, economic and health advantages and disadvantages facing children born at the start of the 21st century, capturing information that the research community of the future will require
to provide a basis for comparing patterns of development with the preceding cohorts (the National Child Development Study, held at the UK Data Archive under GN 33004, and the 1970 Birth Cohort Study, held under GN 33229)
to collect information on previously neglected topics, such as fathers' involvement in children's care and development
to focus on parents as the most immediate elements of the children's 'background', charting their experience as mothers and fathers of newborn babies in the year 2000, recording how they (and any other children in the family) adapted to the newcomer, and what their aspirations for her/his future may be
to emphasise intergenerational links including those back to the parents' own childhood
to investigate the wider social ecology of the family, including social networks, civic engagement and community facilities and services, splicing in geo-coded data when available
Additional objectives subsequently included for MCS were:
to provide control cases for the national evaluation of Sure Start (a government programme intended to alleviate child poverty and social exclusion)
to provide samples of adequate size to analyse and compare the smaller countries of the United Kingdom, and include disadvantaged areas of England
Further information about the MCS can be found on the Centre for Longitudinal Studies web pages.
The content of MCS studies, including questions, topics and variables can be explored via the CLOSER Discovery website.
The first sweep (MCS1) interviewed both mothers and (where resident) fathers (or father-figures) of infants included in the sample when the babies were nine months old, and the second sweep (MCS2) was carried out with the same respondents when the children were three years of age. The third sweep (MCS3) was conducted in 2006, when the children were aged five years old, the fourth sweep (MCS4) in 2008, when they were seven years old, the fifth sweep (MCS5) in 2012-2013, when they were eleven years old, the sixth sweep (MCS6) in 2015, when they were fourteen years old, and the seventh sweep (MCS7) in 2018, when they were seventeen years old.

End User Licence versions of MCS studies:
The End User Licence (EUL) versions of MCS1, MCS2, MCS3, MCS4, MCS5, MCS6 and MCS7 are held under UK Data Archive SNs 4683, 5350, 5795, 6411, 7464, 8156 and 8682 respectively. The longitudinal family file is held under SN 8172.

Sub-sample studies:
Some studies based on sub-samples of MCS have also been conducted, including a study of MCS respondent mothers who had received assisted fertility treatment, conducted in 2003 (see EUL SN 5559). Also, birth registration and maternity hospital episodes for the MCS respondents are held as a separate dataset (see EUL SN 5614).

Release of Sweeps 1 to 4 to Long Format (Summer 2020)
To support longitudinal research and make it easier to compare data from different time points, all data from across all sweeps is now in a consistent format. The update affects the data from sweeps 1 to 4 (from 9 months to 7 years), which are updated from the old/wide to a new/long format to match the format of data of sweeps 5 and 6 (age 11 and 14 sweeps). The old/wide formatted datasets contained one row per family with multiple variables for different respondents. The new/long formatted datasets contain one row per respondent (per parent or per cohort member) for each MCS family. Additional updates have been made to all sweeps to harmonise variable labels and enhance anonymisation.

How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
For information on how to access biomedical data from MCS that are not held at the UKDS, see the CLS Genetic data and biological samples webpage.

Secure Access datasets:
Secure Access versions of the MCS have more restrictive access conditions than versions available under the standard End User Licence or Special Licence (see 'Access data' tab above).

Secure Access versions of the MCS include:
detailed sensitive variables not available under EUL. These have been grouped thematically and are held under SN 8753 (socio-economic, accommodation and occupational data), SN 8754 (self-reported health, behaviour and fertility), SN 8755 (demographics, language and religion) and SN 8756 (exact participation dates). These files replace previously available studies held under SNs 8456 and 8622-8627
detailed geographical identifier files which are grouped by sweep held under SN 7758 (MCS1), SN 7759 (MCS2), SN 7760 (MCS3), SN 7761 (MCS4), SN 7762 (MCS5 2001 Census Boundaries), SN 7763 (MCS5 2011 Census Boundaries), SN 8231 (MCS6 2001 Census Boundaries), SN 8232 (MCS6 2011 Census Boundaries), SN 8757 (MCS7), SN 8758 (MCS7 2001 Census Boundaries) and SN 8759 (MCS7 2011 Census Boundaries). These files replace previously available files grouped by geography SN 7049 (Ward level), SN 7050 (Lower Super Output Area level), and SN 7051 (Output Area level)
linked education administrative datasets for Key Stages 1, 2, 4 and 5 held under SN 8481 (England). This replaces previously available datasets for Key Stage 1 (SN 6862) and Key Stage 2 (SN 7712)
linked education administrative datasets for Key Stage 1 held under SN 7414 (Scotland)
linked education administrative dataset for Key Stages 1, 2, 3 and 4 under SN 9085 (Wales)
linked NHS Patient Episode Database for Wales (PEDW) for MCS1 – MCS5 held under SN 8302
linked Scottish Medical Records data held under SNs 8709, 8710, 8711, 8712, 8713 and 8714;
Banded Distances to English Grammar Schools for MCS5 held under SN 8394
linked Health Administrative Datasets (Hospital Episode Statistics) for England for years 2000-2019 held under SN 9030
linked Health Administrative Datasets (SAIL) for Wales held under SN 9310
linked Hospital of Birth data held under SN 5724.
The linked education administrative datasets held under SNs 8481,7414 and 9085 may be ordered alongside the MCS detailed geographical identifier files only if sufficient justification is provided in the application.

Researchers applying for access to the Secure Access MCS datasets should indicate on their ESRC Accredited Researcher application form the EUL dataset(s) that they also wish to access (selected from the MCS Series Access web page).

The Millennium Cohort Study: Linked Health Administrative Data (Scottish Medical Records), Prescribing Information System, 2009-2015: Secure Access includes data files from the NHS Digital Hospital Episode Statistics database for those cohort members who provided consent to health data linkage in the Age 50 sweep, and had ever lived in Scotland. The Scottish Medical Records database contains information about all hospital admissions in Scotland. This study concerns the Prescribing Information System.
Other datasets are available from the Scottish Medical Records database, these include:
Child Health Reviews (CHR) held under SN 8709
Scottish Immunisation and Recall System (SIRS) held under SN 8711
Scottish Birth Records (SMR11) held under SN 8712
Inpatient and Day Care Attendance (SMR01) held under SN 8713
Outpatient Attendance (SMR00) held under SN 8714
Users should note that linkage to
c
Movies & TV Shows Metadata Dataset (190K+ Records, Horror-Heavy Collection)
crawlfeeds.com
csv, zip
Updated Jun 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Movies & TV Shows Metadata Dataset (190K+ Records, Horror-Heavy Collection) [Dataset]. https://crawlfeeds.com/datasets/movies-tv-shows-metadata-dataset-190k-records-horror-heavy-collection
Explore at:
zip, csvAvailable download formats
Dataset updated
Jun 22, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
This comprehensive dataset features detailed metadata for over 190,000 movies and TV shows, with a strong concentration in the Horror genre. It is ideal for entertainment research, machine learning models, genre-specific trend analysis, and content recommendation systems.

Each record contains rich information, making it perfect for streaming platforms, film industry analysts, or academic media researchers.

Primary Genre Focus: Horror

Use Cases:

Build movie recommendation systems or genre classifiers

Train NLP models on movie descriptions

Analyze Horror content trends over time

Explore box office vs. rating correlations

Enrich entertainment datasets with directorial and cast metadata
d
Electronic Annual Report (eAR) Data for California Water Systems: Processing...
search.dataone.org
hydroshare.org
+1more
Updated Dec 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erik Porse (2023). Electronic Annual Report (eAR) Data for California Water Systems: Processing Scripts and Formatted Data Files [Dataset]. https://search.dataone.org/view/sha256%3A6762d005d81c61981d10576c5b416a30161e2da33ed427d3f02b210164153355
Explore at:
Dataset updated
Dec 30, 2023
Dataset provided by
Hydroshare
Authors
Erik Porse
Time period covered
Jan 1, 2020 - Dec 31, 2022
Area covered

Description
In California, water systems, submit annual operational data such as demographics, water production, water demand, and retail rates to the State Water Resources Control Board. The State Water Resources Control Board publishes data in a flat file text format (https://www.waterboards.ca.gov/drinking_water/certlic/drinkingwater/ear.html). From 2013-2019, distinct data was published for small and large systems. Since 2020, data is combined in a single file.

This Hydroshare repository publishes user-friendly versions of the 2020-2022 eAR files, which were created to improve accessibility. Flat files of raw data were formatted to have all questions associated with a water system (PWSID) on one line. This allows for data to be viewed and analyzed in typical worksheet software programs.

This repository contains 1) Python script templates for parsing the 2020, 2021, and 2022 flat data files, and 2) the formatted eAR data files, saved as an Excel worksheet. There are separate Python scripts for parsing 2020 data and 2021/2022

Use of the script and files is permitted with attribution. Users are solely responsible for any issues that arise in using or applying data. If any errors are spotted, please contact the author.
Data and Analysis Files Repository: Repurposing Large-Format Microarrays for...
zenodo.org
bin, zip
Updated Oct 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Denis Cipurko; Denis Cipurko (2024). Data and Analysis Files Repository: Repurposing Large-Format Microarrays for Scalable Spatial Transcriptomics [Dataset]. http://doi.org/10.5281/zenodo.10963424
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10963424
Dataset updated
Oct 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Denis Cipurko; Denis Cipurko
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
May 2024
Description
Data and Analysis Files from "Repurposing Large-Format Microarrays for Scalable Spatial Transcriptomics"

ArraySeq_Method.zip contains the following folder and contents:

STARSolo: All code and count matrix output from fastq spatial barcode demultiplexing.

Images: All resolution-downsampled H&E image scans from analyzed tissues

Space_Ranger: All 10x Space Ranger output from Visium datasets generated in the paper.

Analysis: All scripts for analyzing and plotting Array-seq and Visium datasets generated in this paper. Also contains output h5ad files.

ArraySeq_Barcode_generation_n12.rmd: The script used to generate the Array-seq probes with 12-mer spatial barcodes.
D
Non Relational Databases Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Oct 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2024). Non Relational Databases Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/non-relational-databases-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Oct 16, 2024
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Non Relational Databases Market Outlook

The global market size for non-relational databases is expected to grow from USD 10.5 billion in 2023 to USD 35.2 billion by 2032, registering a Compound Annual Growth Rate (CAGR) of 14.6% over the forecast period. This substantial growth is primarily driven by increasing demand for scalable, flexible database solutions capable of handling diverse data types and large volumes of data generated across various industries.

One of the significant growth factors for the non-relational databases market is the exponential increase in data generated globally. With the proliferation of Internet of Things (IoT) devices, social media platforms, and digital transactions, the volume of semi-structured and unstructured data is growing at an unprecedented rate. Traditional relational databases often fall short in efficiently managing such data types, making non-relational databases a preferred choice. For example, document-oriented databases like MongoDB allow for the storage of JSON-like documents, offering flexibility in data modeling and retrieval.

Another key driver is the increasing adoption of non-relational databases among enterprises seeking agile and scalable database solutions. The need for high-performance applications that can scale horizontally and handle large volumes of transactions is pushing businesses to shift from traditional relational databases to non-relational databases. This is particularly evident in sectors like e-commerce, where the ability to manage customer data, product catalogs, and transaction histories in real-time is crucial. Additionally, companies in the BFSI (Banking, Financial Services, and Insurance) sector are leveraging non-relational databases for fraud detection, risk management, and customer relationship management.

The advent of cloud computing and the growing trend of digital transformation are also significant contributors to the market growth. Cloud-based non-relational databases offer numerous advantages, including reduced infrastructure costs, scalability, and ease of access. As more organizations migrate their operations to the cloud, the demand for cloud-based non-relational databases is set to rise. Moreover, the availability of Database-as-a-Service (DBaaS) offerings from major cloud providers like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) is simplifying the deployment and management of these databases, further driving their adoption.

Regionally, North America holds the largest market share, driven by the early adoption of advanced technologies and the presence of major market players. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period. The rapid digitalization, growing adoption of cloud services, and increasing investments in IT infrastructure in countries like China and India are propelling the demand for non-relational databases in the region. Additionally, the expanding e-commerce sector and the proliferation of smart devices are further boosting market growth in Asia Pacific.

Type Analysis

The non-relational databases market is segmented into several types, including Document-Oriented Databases, Key-Value Stores, Column-Family Stores, Graph Databases, and Others. Each type offers unique functionalities and caters to specific use cases, making them suitable for different industry requirements. Document-Oriented Databases, such as MongoDB and CouchDB, store data in document format (e.g., JSON or BSON), allowing for flexible schema designs and efficient data retrieval. These databases are widely used in content management systems, e-commerce platforms, and real-time analytics applications due to their ability to handle semi-structured data.

Key-Value Stores, such as Redis and Amazon DynamoDB, store data as key-value pairs, providing extremely fast read and write operations. These databases are ideal for caching, session management, and real-time applications where speed is critical. They offer horizontal scalability and are highly efficient in managing large volumes of data with simple query requirements. The simplicity of the key-value data model and its performance benefits make it a popular choice for high-throughput applications.

Column-Family Stores, such as Apache Cassandra and HBase, store data in columns rather than rows, allowing for efficient storage and retrieval of large datasets. These databases are designed to handle massive amounts of data across distributed systems, making them suitable for use cases involving big data analytics, time-seri
Data from: Observed and naturalized discharge data for large Siberian rivers...
data.ucar.edu
arcticdata.io
+1more
ascii
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander I. Shiklomanov; Daqing Yang (2024). Observed and naturalized discharge data for large Siberian rivers [Dataset]. https://data.ucar.edu/dataset/observed-and-naturalized-discharge-data-for-large-siberian-rivers
Explore at:
asciiAvailable download formats
Dataset updated
Feb 7, 2024
Dataset provided by
University Corporation for Atmospheric Research
Authors
Alexander I. Shiklomanov; Daqing Yang
Time period covered
Jun 28, 1902 - Dec 31, 2009
Area covered

Description
The data set includes two types of discharge data: 1) observed daily discharge values compiled in the State Hydrological Institute, Russia from official sources and 2) modeled "naturalized" daily discharge. The "naturalized" discharge means discharge values with excluded human impact. The data can be used in hydro-climatological analysis to understand interactions between climate and hydrology. A specially developed Hydrograph Transformation Model (HTM) was used to eliminate effects of reservoirs and other human impact from discharge records. These data are formatted as text documents.
Z
Data for "RegulaTome: a corpus of typed, directed, and signed relations...
data.niaid.nih.gov
zenodo.org
Updated Apr 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nastou, Katerina (2024). Data for "RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10808329
Explore at:
Dataset updated
Apr 23, 2024
Dataset authored and provided by
Nastou, Katerina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
RegulaTome corpus: this file contains the RegulaTome corpus in BRAT format. The directory "splits" has the corpus split based on the train/dev/test used for the training of the relation extraction system

RegulaTome annodoc: The annotation guidelines along with the annotation configuration files for BRAT are provided in annodoc+config.tar.gz. The online version of the annotation documentation can be found here: https://katnastou.github.io/regulatome-annodoc/

The tagger software can be found here: https://github.com/larsjuhljensen/tagger. The command used to run tagger before large-scale execution of the RE system is:

gzip -cd ls -1 pmc/*.en.merged.filtered.tsv.gz ls -1r pubmed/*.tsv.gz | cat dictionary/excluded_documents.txt - | tagger/tagcorpus --threads=16 --autodetect --types=dictionary/curated_types.tsv --entities=dictionary/all_entities.tsv --names=dictionary/all_names_textmining.tsv --groups=dictionary/all_groups.tsv --stopwords=dictionary/all_global.tsv --local-stopwords=dictionary/all_local.tsv --type-pairs=dictionary/all_type_pairs.tsv --out-matches=all_matches.tsv

Input documents for large-scale execution, which is done on entire PubMed (as of March 2024) and PMC Open Access (as of November 2023) articles in BioC format. The files are converted to a tab-delimited format to be compatible with the RE system input (see below).

Input dictionary files: all the files necessary to execute the command above are available in tagger_dictionary_files.tar.gz

Tagger output: we filter the results of the tagger run down to gene/protein hits, and documents with more than 1 hit (since we are doing relation extraction) before feeding it to our RE system. The filtered output is available in tagger_matches_ggp_only_gt_1_hit.tsv.gz

Relation extraction system input: combined_input_for_re.tar.gz: these are the directories with all the .ann and .txt files used as input for the large-scale execution of the relation extraction pipeline. The files are generated from the tagger tsv output (see above, tagger_matches_ggp_only_gt_1_hit.tsv.gz) using the tagger2standoff.py script from the string-db-tools repository.

Relation extraction models. The Transformer-based model used for large-scale relation extraction and prediction on the test set is at relation_extraction_multi-label-best_model.tar.gz

The pre-trained RoBERTa model on PubMed and PMC and MIMIC-III with a BPE Vocab learned from PubMed (RoBERTa-large-PM-M3-Voc), which is used by our system is available here.

Relation extraction system output: the tab-delimited outputs of the relation extraction system are found at large_scale_relation_extraction_results.tar.gz !!!ATTENTION this file is approximately 1TB in size, so make sure you have enough space to download it on your machine!!!

The relation extraction system output files have 86 columns: PMID, Entity BRAT ID1, Entity BRAT ID2, and scores per class produced by the relation extraction model. Each file has a header to denote which score is in which column.
Z
Three-component modelling of C-rich AGB-star winds V. – dataset
data.niaid.nih.gov
Updated Oct 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mattsson, Lars (2020). Three-component modelling of C-rich AGB-star winds V. – dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3999342
Explore at:
Dataset updated
Oct 19, 2020
Dataset provided by
Sandin, Christer
Mattsson, Lars
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The provided data include all parameter files, binary output files, and log files that are the basis for the publication in MNRAS.

The file 'file_listing.txt' contains a complete list of files and directories in all gzipped tar files. Each individual gzipped tar file is formatted as follows:

Mm.m_Ll.ll_Ttttt_CtOc.cc.tar.gz

where m.m :: the assumed mass of the model, in solar masses l.ll :: The assumed luminosity, in log10(solar luminosities) tttt :: The effective temperature of the star, in Kelvin. c.cc :: The carbon-to-oxygen excess, in log10(n_C/n_H-n_O/n_H)+12

The contents vary according to the model, but here is the general directory structure:

nodr/ :: non-drift / PC models drift/ :: drift models

nodr/init drift/init :: Initial model files created using John Connor.

File suffixes are the following:

.par :: Plain-text parameter file that contains all parameters that are different from the respective default value in the model. Consequently, to see all used parameters it is necessary to look in the log file (see below).

.bin :: Binary file that contains converged models. Each model is stored in two versions, first the previous time step and then the current time step (both are needed to restart model calculations at that time step).

The initial model file only contains one model; where the previous time step data are the same as the current time step data. The format of this file is explained below. Note! These files can get pretty large and are therefore only available for a smaller number of the models here. Please ask the corresponding author for the missing files should the need appear.

.log :: Plain-text log file that shows the used model parameters and a number of key properties for each converged model. The encoding of this file is UTF-8.

.inf :: Plain-text secondary log file that contains the header of the [primary] log file as well as timing information.

.tpb :: Secondary binary file that contains a number of properties specified at the outer boundary, typically for each consecutive time step.

.lis :: Plain-text file with the iteration history. Available for some files.

.liv :: Plain-text file with values specified for a number of properties at each gridpoint. Available for a smaller number of files.

.inp :: Plain-text file that is used to launch a model; some are still there.

.eps :: Encapsulated PostScript files created by John Connor when calculating the initial model.

Model evolution structure - file endings before the suffix:

_rlx :: Files related to relaxing the T-800 calculations on the initial model created by John Connor.

_exp :: Files related to expanding the initially compact model to using the full radial domain.

_fix :: Files related to the intermediate stage where calculations are changed from expansion to outflow.

_out :: Files related to the outflow stage of the calculations; this is what you want to look at to see the wind evolution. Results in the paper are calculated using these data.

Note! Some outflow stage calculations continue the evolution of the previous set of files. The underlying reason for continued calculations is typically that the calculated time interval is too short. Such files are typically given the extension '_cont.lin_out', '_cont2.lin_out', etc.

Load files:

Two tools are provided here that can load the binary data files using the Interactive Data Language (IDL):

sc_load_bin (for files with the suffix '.bin'):

Loads the full content of a T-800 binary file and returns a structure with the data.

sc_load_tpb (for files with the suffix '.tpb'):

Loads the full content of a T-800 'tpb' binary file and returns a structure with the data. Note! Due to the way models run on clusters, this file is sometimes incomplete; this happens when the model code T-800 is stopped as the cluster-specific walltime is reached. If this is the case, it is necessary to use the binary file instead, where data are saved typically every 20:th time step.

Alternative tools for use with Python and Julia could be considered for writing, but where not yet available when this dataset was made public. Please contact the corresponding author for a current status on this issue.
L
Legal Document Checking and Formatting Software Report
archivemarketresearch.com
doc, pdf, ppt
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Legal Document Checking and Formatting Software Report [Dataset]. https://www.archivemarketresearch.com/reports/legal-document-checking-and-formatting-software-19886
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Feb 10, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
Market Size and Growth: The global legal document checking and formatting software market is projected to reach a value of X million by 2033, expanding at a CAGR of XX% from 2025 to 2033. This surge is attributed to the increasing need for accuracy, efficiency, and consistency in legal document preparation. The increasing adoption of cloud-based solutions and the rise of artificial intelligence (AI) and machine learning (ML) technologies are further driving market growth. Market Trends and Restraints: Key trends shaping the market include the adoption of cloud-based platforms for enhanced accessibility and collaboration, the use of AI and ML for automated document analysis and formatting, and the growing demand for industry-specific solutions. Market restraints include the cost of implementation, data security concerns, and the need for skilled professionals to manage and interpret the software's output. The market is also segmented by type (on-premise and cloud-based) and application (large enterprises and SMEs), with large enterprises holding a dominant market share due to their need for robust document management systems.

Facebook

Twitter

Click to copy link

Link copied

Cite

Market Report Analytics (2025). Document Formatting Service Report [Dataset]. https://www.marketreportanalytics.com/reports/document-formatting-service-75560

Document Formatting Service Report

Explore at:

pdf, ppt, docAvailable download formats

Dataset updated

Apr 10, 2025

Dataset authored and provided by

Market Report Analytics

License

https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

Time period covered

2025 - 2033

Area covered

Global

Variables measured

Market Size

Description

The global document formatting services market is experiencing robust growth, driven by the increasing demand for professionally formatted documents across various sectors. The market's expansion is fueled by several key factors. Firstly, the proliferation of digital documents and the need for consistent branding and professional presentation across all communication channels are driving demand. Secondly, the rising complexity of document creation, particularly in fields like legal and finance, necessitates specialized formatting expertise. Businesses are increasingly outsourcing this function to focus on core competencies, leading to significant market expansion. The academic sector also contributes substantially, with students and researchers requiring formatting assistance for theses, dissertations, and research papers. While specific market size figures aren't provided, considering the growth in related sectors like digital publishing and freelance editing, a reasonable estimation for the 2025 market size could be around $2.5 billion, growing at a conservative Compound Annual Growth Rate (CAGR) of 10% over the forecast period (2025-2033). This growth is largely segmented across different application areas, with the business and legal sectors showing particularly strong demand. The service itself is divided across document types, with Word documents, PowerPoint presentations, and Excel spreadsheets representing the largest shares. North America and Europe currently hold the largest market shares, but growth potential is high in the Asia-Pacific region, driven by burgeoning economies and increased digital adoption. Despite its growth trajectory, the market faces some challenges. Competition amongst numerous providers, ranging from large outsourcing firms to individual freelancers, can lead to price pressure. The need for specialized expertise within specific document formatting standards (e.g., legal citations) requires continuous investment in training and upskilling. Moreover, concerns about data security and confidentiality within client documents are areas that providers must address effectively. The evolving technological landscape, with the potential introduction of more advanced automated formatting tools, also represents a long-term challenge. However, the ongoing demand for high-quality, error-free documentation suggests that human-driven expertise in document formatting will remain highly relevant and in demand for the foreseeable future.

Clear search

Close search

Google apps

Main menu

Document Formatting Service Report

Argos Satellite Tracking Data for Thick-billed Murres (Uria lomvia) - Raw...

Data from: A large synthetic dataset for machine learning applications in...

Data generation algorithm

Network

Time series

Usage

Selecting a particular country

Averaging over time

Source code

Funding

Mobile Source Emissions Regulatory Compliance Data Inventory

QUMPHY MIMIC IV Waveform Database PPG formatted

Large format supplies llc USA Import & Buyer Data

Table_segmentation Dataset

Downloading King County GIS Elevation Contour Data

Data from: ARS Water Database

IUST-PDFCorpus

Data from: Lidar - LMCT - WTX WindTracer, Gordon Ridge - Raw Data

Millennium Cohort Study: Linked Health Administrative Data (Scottish Medical...

Movies & TV Shows Metadata Dataset (190K+ Records, Horror-Heavy Collection)

Use Cases:

Electronic Annual Report (eAR) Data for California Water Systems: Processing...

Data and Analysis Files Repository: Repurposing Large-Format Microarrays for...

Non Relational Databases Market Report | Global Forecast From 2025 To 2033

Non Relational Databases Market Outlook

Type Analysis

Data from: Observed and naturalized discharge data for large Siberian rivers...

Data for "RegulaTome: a corpus of typed, directed, and signed relations...

Three-component modelling of C-rich AGB-star winds V. – dataset

Legal Document Checking and Formatting Software Report

Document Formatting Service ReportSee More Versions

Document Formatting Service Report