https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The global document formatting services market is experiencing robust growth, driven by the increasing demand for professionally formatted documents across various sectors. The market's expansion is fueled by several key factors. Firstly, the proliferation of digital documents and the need for consistent branding and professional presentation across all communication channels are driving demand. Secondly, the rising complexity of document creation, particularly in fields like legal and finance, necessitates specialized formatting expertise. Businesses are increasingly outsourcing this function to focus on core competencies, leading to significant market expansion. The academic sector also contributes substantially, with students and researchers requiring formatting assistance for theses, dissertations, and research papers. While specific market size figures aren't provided, considering the growth in related sectors like digital publishing and freelance editing, a reasonable estimation for the 2025 market size could be around $2.5 billion, growing at a conservative Compound Annual Growth Rate (CAGR) of 10% over the forecast period (2025-2033). This growth is largely segmented across different application areas, with the business and legal sectors showing particularly strong demand. The service itself is divided across document types, with Word documents, PowerPoint presentations, and Excel spreadsheets representing the largest shares. North America and Europe currently hold the largest market shares, but growth potential is high in the Asia-Pacific region, driven by burgeoning economies and increased digital adoption. Despite its growth trajectory, the market faces some challenges. Competition amongst numerous providers, ranging from large outsourcing firms to individual freelancers, can lead to price pressure. The need for specialized expertise within specific document formatting standards (e.g., legal citations) requires continuous investment in training and upskilling. Moreover, concerns about data security and confidentiality within client documents are areas that providers must address effectively. The evolving technological landscape, with the potential introduction of more advanced automated formatting tools, also represents a long-term challenge. However, the ongoing demand for high-quality, error-free documentation suggests that human-driven expertise in document formatting will remain highly relevant and in demand for the foreseeable future.
This metadata document describes the data contained in the "rawData" folder of this data package. This data package contains all data collected by the Argos System from 20 satellite transmitters attached to Thick-billed murres on their breeding range in arctic and western Alaska, 1995-1996. Five data files are included in the "rawData" folder of this data package. Two data files (with identical content) contain the raw Argos DIAG (Diagnostic) data, one in the legacy verbose ASCII format and one in a tabular Comma Separate Value (CSV) format. Two other data files (with identical content) contain the raw Argos DS (Dispose) data, one in the legacy verbose ASCII format and one in a tabular CSV format. The fifth file, "deploymentAttributes", contains one record for each transmitter deployment in a CSV formatted table. The deployment attributes file contains information such as when the transmitter was attached to the animal, when tracking of a live animal ended, and a variety of variables describing the animal and transmitter. This table is identical to the "deploymentAttributes" table in the "processedData" folder of this data package.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.
This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.
The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.
The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.
The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.
There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).
The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.
This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):
import pandas as pd
CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)
The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:
CH_gens_list = CH_gens.dropna().squeeze().to_list()
Finally, we can import all the time series of Swiss generators from a given data table with
pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)
The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.
This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:
hourly_loads = pd.read_csv('loads_2018_3.csv')
To get a daily average of the loads, we can use:
daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()
This results in series of length 364. To average further over entire weeks and get series of length 52, we use:
weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()
The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.
This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
The Mobile Source Emissions Regulatory Compliance Data Inventory data asset contains measured summary compliance information on light-duty, heavy-duty, and non-road engine manufacturers by model, as well as fee payment data required by Title II of the 1990 Amendments to the Clean Air Act, to certify engines for sale in the U.S. and collect compliance certification fees. Data submitted by manufacturers falls into 12 industries: Heavy Duty Compression Ignition, Marine Spark Ignition, Heavy Duty Spark Ignition, Marine Compression Ignition, Snowmobile, Motorcycle & ATV, Non-Road Compression Ignition, Non-Road Small Spark Ignition, Light-Duty, Evaporative Components, Non-Road Large Spark Ignition, and Locomotive. Title II also requires the collection of fees from manufacturers submitting for compliance certification. Manufacturers submit data on an annual basis, to document engine model changes for certification. Manufacturers also submit compliance information on already certified in-use vehicles randomly selected by the EPA (1) year into their life and (4) years into their life to ensure that emissions systems continue to function appropriately over time.The EPA performs targeted confirmatory tests on approximately 15% of vehicles submitted for certification. Confirmatory data on engines is associated with its corresponding submission data to verify the accuracy of manufacturer submission beyond standard business rules.Section 209 of the 1990 Amendments to the Clean Air Act grants the State of California the authority to set its own standards and perform its own compliance certification through the California Air Resources Board (CARB). Currently manufacturers submit compliance information separately to both the EPA and CARB. Currently, data harmonization occurs between EPA data and CARB data only for Motorcycle & ATV submissions.Submitted data comes in XML format or as documents, with the majority of submissions being sent in XML. Data includes descriptive information on the engine itself, as well as on manufacturer testing methods and results. Submissions may include information (CBI) such as information on estimated sales, new technologies, catalysts and calibration, or other data elements indicated by the submitter as confidential. CBI data is not publically available, but it is available within EPA under the restrictions of the Office of Transportation and Air Quality (OTAQ) CBI policy [RCS Link]. Pollution emission data covers a range of Criteria Air Pollutants (CAPs) including carbon monoxide, hydrocarbons, nitrogen oxides, and particulate matter. Datasets are segmented by vehicle/engine model and year, with corresponding emission, test, and certification data. Data assets are primarily stored in EPA's Verify system. Data collected from the Heavy Duty Compression Ignition, Marine Spark Ignition, Heavy Duty Spark Ignition, Marine Compression Ignition, and Snowmobile industries, however, are currently stored in legacy systems the will be migrated to Verify in the future.Coverage began in 1979, with early records being primarily paper documents that did not go through the same level of validation as the digital submissions that began in 2005.Mobile Source Emissions Compliance documents with metadata, certificate and summary decision information is made available to the public through EPA.gov via the OTAQ Document Index System (http://iaspub.epa.gov/otaqpub/).
Derivative of the MIMIC IV Waveform Database formatted to be suitable for machine learning. Formatting All records are split into intervals of roughly 60 seconds. The parameter values are averaged over each 60 second interval. The PPG signal data are unprocessed, i.e. as in the original dataset. Intervals with PPG signals containing missing data or large constant data are excluded. PPG signals and signal times are truncated to have the same amount of data points for all records. Formatted data are split into 3 different file types, namely *_n.csv containing the averaged parameter values, *_s.npy containing PPG signal data and t.npy containing the respective signal measurement times. Moreover, formatted data are split into trainXX, validation_* and test_* data files, where the training data trainXX_* are split into multiple files for easier handling. This dataset was created using the following code: https://gitlab.com/qumphy/wp1-benchmark-data-conversion Funding The creation of this dataset has been supported by the European Partnership on Metrology programme 22HLT01 QUMPHY. This project (22HTL01 QUMPHY) has received funding from the EMPIR programme cofinanced by the Participating States and from the European Union’s Horizon 2020 research and innovation programme.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Data Extraction from Complex Documents: The model could be used to segment and extract data from complex documents such as financial statements, invoices or reports. Its ability to identify lines and headers could help in parsing data accurately.
Improvement of Accessibility Features: The model could be deployed in applications for visually impaired people, helping them understand text-based data represented in tables by recognizing and vocally relaying the content of each cell organized by lines and headers.
Automating Data Conversion: The model could be used for automating conversion of printed tables into digital format. It can help in scanning books, research papers or old documents and convert tables in them into easily editable and searchable digital format.
Intelligent Data Analysis Tools: It could be incorporated into a Data Analysis Software to pull out specific table data from a large number of documents, thus making the data analysis process more efficient.
Aid in Educational Settings: The model can be used in educational tools to recognize and interpret table data for online learning systems, making studying more interactive and efficient, especially in subjects where tables are commonly used such as Statistics, Economics, and Sciences.
Countywide datasets are available as zipped Esri geodatabases. Sets of the 5-foot-interval contours at township-level extents are available as zipped shapefiles in addition to geodatabases. (None of the data are available in GeoJSON or KML format.) Note that the zipped files are exceptionally large.All files are compressed in the open-source 7-Zip format (external link to 7-zip.org). Other utilities which can extract zipped files will work in most cases, but some of these data files might extract with 7-Zip only.
The ARS Water Data Base is a collection of precipitation and streamflow data from small agricultural watersheds in the United States. This national archive of variable time-series readings for precipitation and runoff contains sufficient detail to reconstruct storm hydrographs and hyetographs. There are currently about 14,000 station years of data stored in the data base. Watersheds used as study areas range from 0.2 hectare (0.5 acres) to 12,400 square kilometers (4,786 square miles). Raingage networks range from one station per watershed to over 200 stations. The period of record for individual watersheds vary from 1 to 50 years. Some watersheds have been in continuous operation since the mid 1930's. Resources in this dataset:Resource Title: FORMAT INFORMATION FOR VARIOUS RECORD TYPES. File Name: format.txtResource Description: Format information identifying fields and their length will be included in this file for all files except those ending with the extension .txt TYPES OF FILES As indicated in the previous section data has been stored by location number in the form, LXX where XX is the location number. In each subdirectory, there will be various files using the following naming conventions: Runoff data: WSXXX.zip where XXX is the watershed number assigned by the WDC. This number may or may not correspond to a naming convention used in common literature. Rainfall data: RGXXXXXX.zip where XXXXXX is the rain gage station identification. Maximum-minimum daily air temperature: MMTXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Ancillary text files: NOTXXXXX.txt where XXXXX is the watershed number assigned by the WDC. These files will contain textual information including latitude-longitude, name commonly used in literature, acreage, most commonly-associated rain gage(s) (if known by the WDC), a list of all rain gages on or near the watershed. Land use, topography, and soils as known by the WDC. Topographic maps of the watersheds: MAPXXXXX.zip where XXXXX is the location/watershed number assigned by the WDC. Map files are binary TIF files. NOT ALL FILE TYPES MAY BE AVAILABLE FOR SPECIFIC WATERSHEDS. Data files are still being compiled and translated into a form viable for this archive. Please bear with us while we grow.Resource Title: Data Inventory - watersheds. File Name: inventor.txtResource Description: Watersheds at which records of runoff were being collected by the Agricultural Research Service. Variables: Study Location & Number of Rain Gages1; Name; Lat.; Long; Number; Pub. Code; Record Began; Land Use2; Area (Acres); Types of Data3Resource Title: Information about the ARS Water Database. File Name: README.txtResource Title: INDEX TO INFORMATION ON EXPERIMENTAL AGRICULTURAL WATERSHEDS. File Name: INDEX.TXTResource Description: This report includes identification information on all watersheds operated by the ARS. Only some of these are included in the ARS Water Data Base. They are so indicated in the column titled ARS Water Data Base. Other watersheds will not have data available here or through the Water Data Center. This index is particularly important since it relates watershed names with the indexing system used by the Water Data Center. Each location has been assigned a number. The data for that location will be stored in a sub-directory coded as LXX where XX is the location number. The index also indicates the watershed number used by the WDC. Data for a particular watershed will be stored in a compressed file named WSXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Although not included in the index, rain gage information will be stored in compressed files named RGXXXXXX.zip where XXXXXX is a 6-character identification of the rain gage station. The Index also provides information such as latitude-longitude for each of the watersheds, acreage, the period-of-record for each acreage. Multiple entries for a particular watershed will either indicate that the acreage designated for the watershed changed or there was a break in operations of the watershed. Resource Title: ARS Water Database files. File Name: ars_water.zipResource Description: USING THIS SYSTEM Before downloading huge amounts of data from the ARS Water Data Base, you should first review the text files included in this directory. They include: INDEX OF ARS EXPERIMENTAL WATERSHEDS: index.txt This report includes identification information on all watersheds operated by the ARS. Only some of these are included in the ARS Water Data Base. They are so indicated in the column titled ARS Water Data Base. Other watersheds will not have data available here or through the Water Data Center. This index is particularly important since it relates watershed names with the indexing system used by the Water Data Center. Each location has been assigned a number. The data for that location will be stored in a sub-directory coded as LXX where XX is the location number. The index also indicates the watershed number used by the WDC. Data for a particular watershed will be stored in a compressed file named WSXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Although not included in the index, rain gage information will be stored in compressed files named RGXXXXXX.zip where XXXXXX is a 6-character identification of the rain gage station. The Index also provides information such as latitude-longitude for each of the watersheds, acreage, the period-of-record for each acreage. Multiple entries for a particular watershed will either indicate that the acreage designated for the watershed changed or there was a break in operations of the watershed. STATION TABLE FOR THE ARS WATER DATA BASE: station.txt This report indicates the period of record for each recording station represented in the ARS Water Data Base. The data for a particular station will be stored in a single compressed file. FORMAT INFORMATION FOR VARIOUS RECORD TYPES: format.txt Format information identifying fields and their length will be included in this file for all files except those ending with the extension .txt TYPES OF FILES As indicated in the previous section data has been stored by location number in the form, LXX where XX is the location number. In each subdirectory, there will be various files using the following naming conventions: Runoff data: WSXXX.zip where XXX is the watershed number assigned by the WDC. This number may or may not correspond to a naming convention used in common literature. Rainfall data: RGXXXXXX.zip where XXXXXX is the rain gage station identification. Maximum-minimum daily air temperature: MMTXXXXX.zip where XXXXX is the watershed number assigned by the WDC. Ancillary text files: NOTXXXXX.txt where XXXXX is the watershed number assigned by the WDC. These files will contain textual information including latitude-longitude, name commonly used in literature, acreage, most commonly-associated rain gage(s) (if known by the WDC), a list of all rain gages on or near the watershed. Land use, topography, and soils as known by the WDC. Topographic maps of the watersheds: MAPXXXXX.zip where XXXXX is the location/watershed number assigned by the WDC. Map files are binary TIF files. NOT ALL FILE TYPES MAY BE AVAILABLE FOR SPECIFIC WATERSHEDS. Data files are still being compiled and translated into a form viable for this archive. Please bear with us while we grow.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
About
IUST-PDFCorpus is a large set of various PDF files, aimed at building and manipulating new PDF files, to test, debug, and improve the qualification of real-world PDF readers such as Adobe Acrobat Reader, Foxit Reader, Nitro Reader, MuPDF. IUST-PDFCorpus contains 6,141 PDF complete files in various sizes and contents. The corpus includes 507,299 PDF data objects and 151,132 PDF streams extracted from the set of complete files. Data objects are in the textual format while streams have a binary format and together they make PDF files. In addition, we attached the code coverage of each PDF file when it used as test data in testing MuPDF. The coverage info is available in both binary and XML formats. PDF data objects are organized into three categories. The first category contains all objects in the corpus. Each file in this category holds all PDF objects extracted from one PDF file without any preprocessing. The second category is a dataset made by merging all files in the first category with some preprocessing. The dataset is spilled into train, test and validation set which is useful for using in the machine learning tasks. The third category is the same as the second category but in a smaller size for using in the developing stage of different algorithms. IUST-PDFCorpus is collected from various sources including the Mozilla PDF.js open test corpus, some PDFs which are used in AFL as initial seed, and PDFs gathered from existing e-books, software documents, and public web in different languages. We first introduced IUST-PDFCorpus in our paper “Format-aware learn&fuzz: deep test data generation for efficient fuzzing” where we used it to build an intelligent file format fuzzer, called IUST-DeepFuzz. For the time being, we are gathering other file formats to automate testing of related applications.
Citing IUST-PDFCorpus
If IUST-PDFCorpus is used in your work in any form please cite the relevant paper: https://arxiv.org/abs/1812.09961v2
Overview Long-range scanning Doppler lidar located on Gordon Ridge. The WindTracer provides high-resolution, long-range lidar data for use in the WFIP2 program. Data Details The system is configured to take data in three different modes. All three modes take 15 minutes to complete and are started at 00, 15, 30, and 45 minutes after the hour. The first nine minutes of the period are spent performing two high-resolution, long-range Plan Position Indicator (PPI) scans at 0.0 and -1.0 degree elevation angles (tilts). These data have file names annotated with HiResPPI noted in the "optional fields" of the file name; for example: lidar.z09.00.20150801.150000.HiResPPI.prd. The next six minutes are spent performing higher altitude PPI scans and Range Height Indicator (RHI) scans. The PPI scans are completed at 6.0- and 30.0-degree elevations, and the RHI scans are completed from below the horizon (down into valleys, as able), up to 40 degrees elevation at 010-, 100-, 190-, and 280-degree azimuths. These files are annotated with PPI-RHI noted in the optional fields of the file name; for example: lidar.z09.00.20150801.150900.PPI-RHI.prd *The last minute is spent measuring a high-altitude vertical wind profile. Generally, this dataset will include data from near ground level up to the top of the planetary boundary layer (PBL), and higher altitude data when high-level cirrus or other clouds are present. The Velocity Azimuth Display (VAD) is measured using six lines of sight at an elevation angle of 75 degrees at azimuth angles of 000, 060, 120, 180, 240, and 300 degrees from True North. The files are annotated with VAD in the optional fields of the file name; for example: lidar.z09.00.20150801.151400.VAD.prd. LMCT does have a data format document that can be provided to users who need programming access to the data. This document is proprietary information but can be supplied to anyone after signing a non-disclosure agreement (NDA). To initiate the NDA process, please contact Keith Barr at keith.barr@lmco.com. The data are not proprietary, only the manual describing the data format. Data Quality Lockheed Martin Coherent Technologies (LMCT) has implemented and refined data quality analysis over the last 14 years, and this installation uses standard data-quality processing procedures. Generally, filtered data products can be accepted as fully data qualified. Secondary processing, such as wind vector analysis, should be used with some caution as the data-quality filters still are "young" and incorrect values can be encountered. Uncertainty Uncertainty in the radial wind measurements (the system's base measurement) varies slightly with range. For most measurements, accuracy of the filtered radial wind measurements have been shown to be within 0.5 m/s with accuracy better than 0.25 m/s not uncommon for ranges less than 10 km. Constraints Doppler lidar is dependent on aerosol loading in the atmosphere, and the signal can be significantly attenuated in precipitation and fog. These weather situations can reduce range performance significantly, and, in heavy rain or thick fog, range performance can be reduced to zero. Long-range performance depends on adequate aerosol loading to provide enough backscattered laser radiation so that a measurement can be made.
Background:
The Millennium Cohort Study (MCS) is a large-scale, multi-purpose longitudinal dataset providing information about babies born at the beginning of the 21st century, their progress through life, and the families who are bringing them up, for the four countries of the United Kingdom. The original objectives of the first MCS survey, as laid down in the proposal to the Economic and Social Research Council (ESRC) in March 2000, were:
Further information about the MCS can be found on the Centre for Longitudinal Studies web pages.
The content of MCS studies, including questions, topics and variables can be explored via the CLOSER Discovery website.
The first sweep (MCS1) interviewed both mothers and (where resident) fathers (or father-figures) of infants included in the sample when the babies were nine months old, and the second sweep (MCS2) was carried out with the same respondents when the children were three years of age. The third sweep (MCS3) was conducted in 2006, when the children were aged five years old, the fourth sweep (MCS4) in 2008, when they were seven years old, the fifth sweep (MCS5) in 2012-2013, when they were eleven years old, the sixth sweep (MCS6) in 2015, when they were fourteen years old, and the seventh sweep (MCS7) in 2018, when they were seventeen years old.The Millennium Cohort Study: Linked Health Administrative Data (Scottish Medical Records), Prescribing Information System, 2009-2015: Secure Access includes data files from the NHS Digital Hospital Episode Statistics database for those cohort members who provided consent to health data linkage in the Age 50 sweep, and had ever lived in Scotland. The Scottish Medical Records database contains information about all hospital admissions in Scotland. This study concerns the Prescribing Information System.
Other datasets are available from the Scottish Medical Records database, these include:
Users should note that linkage to
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
This comprehensive dataset features detailed metadata for over 190,000 movies and TV shows, with a strong concentration in the Horror genre. It is ideal for entertainment research, machine learning models, genre-specific trend analysis, and content recommendation systems.
Each record contains rich information, making it perfect for streaming platforms, film industry analysts, or academic media researchers.
Primary Genre Focus: Horror
Build movie recommendation systems or genre classifiers
Train NLP models on movie descriptions
Analyze Horror content trends over time
Explore box office vs. rating correlations
Enrich entertainment datasets with directorial and cast metadata
In California, water systems, submit annual operational data such as demographics, water production, water demand, and retail rates to the State Water Resources Control Board. The State Water Resources Control Board publishes data in a flat file text format (https://www.waterboards.ca.gov/drinking_water/certlic/drinkingwater/ear.html). From 2013-2019, distinct data was published for small and large systems. Since 2020, data is combined in a single file.
This Hydroshare repository publishes user-friendly versions of the 2020-2022 eAR files, which were created to improve accessibility. Flat files of raw data were formatted to have all questions associated with a water system (PWSID) on one line. This allows for data to be viewed and analyzed in typical worksheet software programs.
This repository contains 1) Python script templates for parsing the 2020, 2021, and 2022 flat data files, and 2) the formatted eAR data files, saved as an Excel worksheet. There are separate Python scripts for parsing 2020 data and 2021/2022
Use of the script and files is permitted with attribution. Users are solely responsible for any issues that arise in using or applying data. If any errors are spotted, please contact the author.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and Analysis Files from "Repurposing Large-Format Microarrays for Scalable Spatial Transcriptomics"
ArraySeq_Method.zip contains the following folder and contents:
ArraySeq_Barcode_generation_n12.rmd: The script used to generate the Array-seq probes with 12-mer spatial barcodes.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global market size for non-relational databases is expected to grow from USD 10.5 billion in 2023 to USD 35.2 billion by 2032, registering a Compound Annual Growth Rate (CAGR) of 14.6% over the forecast period. This substantial growth is primarily driven by increasing demand for scalable, flexible database solutions capable of handling diverse data types and large volumes of data generated across various industries.
One of the significant growth factors for the non-relational databases market is the exponential increase in data generated globally. With the proliferation of Internet of Things (IoT) devices, social media platforms, and digital transactions, the volume of semi-structured and unstructured data is growing at an unprecedented rate. Traditional relational databases often fall short in efficiently managing such data types, making non-relational databases a preferred choice. For example, document-oriented databases like MongoDB allow for the storage of JSON-like documents, offering flexibility in data modeling and retrieval.
Another key driver is the increasing adoption of non-relational databases among enterprises seeking agile and scalable database solutions. The need for high-performance applications that can scale horizontally and handle large volumes of transactions is pushing businesses to shift from traditional relational databases to non-relational databases. This is particularly evident in sectors like e-commerce, where the ability to manage customer data, product catalogs, and transaction histories in real-time is crucial. Additionally, companies in the BFSI (Banking, Financial Services, and Insurance) sector are leveraging non-relational databases for fraud detection, risk management, and customer relationship management.
The advent of cloud computing and the growing trend of digital transformation are also significant contributors to the market growth. Cloud-based non-relational databases offer numerous advantages, including reduced infrastructure costs, scalability, and ease of access. As more organizations migrate their operations to the cloud, the demand for cloud-based non-relational databases is set to rise. Moreover, the availability of Database-as-a-Service (DBaaS) offerings from major cloud providers like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) is simplifying the deployment and management of these databases, further driving their adoption.
Regionally, North America holds the largest market share, driven by the early adoption of advanced technologies and the presence of major market players. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period. The rapid digitalization, growing adoption of cloud services, and increasing investments in IT infrastructure in countries like China and India are propelling the demand for non-relational databases in the region. Additionally, the expanding e-commerce sector and the proliferation of smart devices are further boosting market growth in Asia Pacific.
The non-relational databases market is segmented into several types, including Document-Oriented Databases, Key-Value Stores, Column-Family Stores, Graph Databases, and Others. Each type offers unique functionalities and caters to specific use cases, making them suitable for different industry requirements. Document-Oriented Databases, such as MongoDB and CouchDB, store data in document format (e.g., JSON or BSON), allowing for flexible schema designs and efficient data retrieval. These databases are widely used in content management systems, e-commerce platforms, and real-time analytics applications due to their ability to handle semi-structured data.
Key-Value Stores, such as Redis and Amazon DynamoDB, store data as key-value pairs, providing extremely fast read and write operations. These databases are ideal for caching, session management, and real-time applications where speed is critical. They offer horizontal scalability and are highly efficient in managing large volumes of data with simple query requirements. The simplicity of the key-value data model and its performance benefits make it a popular choice for high-throughput applications.
Column-Family Stores, such as Apache Cassandra and HBase, store data in columns rather than rows, allowing for efficient storage and retrieval of large datasets. These databases are designed to handle massive amounts of data across distributed systems, making them suitable for use cases involving big data analytics, time-seri
The data set includes two types of discharge data: 1) observed daily discharge values compiled in the State Hydrological Institute, Russia from official sources and 2) modeled "naturalized" daily discharge. The "naturalized" discharge means discharge values with excluded human impact. The data can be used in hydro-climatological analysis to understand interactions between climate and hydrology. A specially developed Hydrograph Transformation Model (HTM) was used to eliminate effects of reservoirs and other human impact from discharge records. These data are formatted as text documents.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RegulaTome corpus: this file contains the RegulaTome corpus in BRAT format. The directory "splits" has the corpus split based on the train/dev/test used for the training of the relation extraction system
RegulaTome annodoc: The annotation guidelines along with the annotation configuration files for BRAT are provided in annodoc+config.tar.gz. The online version of the annotation documentation can be found here: https://katnastou.github.io/regulatome-annodoc/
The tagger software can be found here: https://github.com/larsjuhljensen/tagger. The command used to run tagger before large-scale execution of the RE system is:
gzip -cd ls -1 pmc/*.en.merged.filtered.tsv.gz
ls -1r
pubmed/*.tsv.gz
| cat dictionary/excluded_documents.txt - |
tagger/tagcorpus --threads=16 --autodetect
--types=dictionary/curated_types.tsv
--entities=dictionary/all_entities.tsv
--names=dictionary/all_names_textmining.tsv
--groups=dictionary/all_groups.tsv --stopwords=dictionary/all_global.tsv
--local-stopwords=dictionary/all_local.tsv
--type-pairs=dictionary/all_type_pairs.tsv --out-matches=all_matches.tsv
Input documents for large-scale execution, which is done on entire PubMed (as of March 2024) and PMC Open Access (as of November 2023) articles in BioC format. The files are converted to a tab-delimited format to be compatible with the RE system input (see below).
Input dictionary files: all the files necessary to execute the command above are available in tagger_dictionary_files.tar.gz
Tagger output: we filter the results of the tagger run down to gene/protein hits, and documents with more than 1 hit (since we are doing relation extraction) before feeding it to our RE system. The filtered output is available in tagger_matches_ggp_only_gt_1_hit.tsv.gz
Relation extraction system input: combined_input_for_re.tar.gz: these are the directories with all the .ann and .txt files used as input for the large-scale execution of the relation extraction pipeline. The files are generated from the tagger tsv output (see above, tagger_matches_ggp_only_gt_1_hit.tsv.gz) using the tagger2standoff.py script from the string-db-tools repository.
Relation extraction models. The Transformer-based model used for large-scale relation extraction and prediction on the test set is at relation_extraction_multi-label-best_model.tar.gz
The pre-trained RoBERTa model on PubMed and PMC and MIMIC-III with a BPE Vocab learned from PubMed (RoBERTa-large-PM-M3-Voc), which is used by our system is available here.
Relation extraction system output: the tab-delimited outputs of the relation extraction system are found at large_scale_relation_extraction_results.tar.gz !!!ATTENTION this file is approximately 1TB in size, so make sure you have enough space to download it on your machine!!!
The relation extraction system output files have 86 columns: PMID, Entity BRAT ID1, Entity BRAT ID2, and scores per class produced by the relation extraction model. Each file has a header to denote which score is in which column.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The provided data include all parameter files, binary output files, and log files that are the basis for the publication in MNRAS.
The file 'file_listing.txt' contains a complete list of files and directories in all gzipped tar files. Each individual gzipped tar file is formatted as follows:
Mm.m_Ll.ll_Ttttt_CtOc.cc.tar.gz
where m.m :: the assumed mass of the model, in solar masses l.ll :: The assumed luminosity, in log10(solar luminosities) tttt :: The effective temperature of the star, in Kelvin. c.cc :: The carbon-to-oxygen excess, in log10(n_C/n_H-n_O/n_H)+12
The contents vary according to the model, but here is the general directory structure:
nodr/ :: non-drift / PC models drift/ :: drift models
nodr/init drift/init :: Initial model files created using John Connor.
File suffixes are the following:
.par :: Plain-text parameter file that contains all parameters that are different from the respective default value in the model. Consequently, to see all used parameters it is necessary to look in the log file (see below).
.bin :: Binary file that contains converged models. Each model is stored in two versions, first the previous time step and then the current time step (both are needed to restart model calculations at that time step).
The initial model file only contains one model; where the previous
time step data are the same as the current time step data.
The format of this file is explained below.
Note! These files can get pretty large and are therefore only
available for a smaller number of the models here. Please ask the
corresponding author for the missing files should the need appear.
.log :: Plain-text log file that shows the used model parameters and a number of key properties for each converged model. The encoding of this file is UTF-8.
.inf :: Plain-text secondary log file that contains the header of the [primary] log file as well as timing information.
.tpb :: Secondary binary file that contains a number of properties specified at the outer boundary, typically for each consecutive time step.
.lis :: Plain-text file with the iteration history. Available for some files.
.liv :: Plain-text file with values specified for a number of properties at each gridpoint. Available for a smaller number of files.
.inp :: Plain-text file that is used to launch a model; some are still there.
.eps :: Encapsulated PostScript files created by John Connor when calculating the initial model.
Model evolution structure - file endings before the suffix:
_rlx :: Files related to relaxing the T-800 calculations on the initial model created by John Connor.
_exp :: Files related to expanding the initially compact model to using the full radial domain.
_fix :: Files related to the intermediate stage where calculations are changed from expansion to outflow.
_out :: Files related to the outflow stage of the calculations; this is what you want to look at to see the wind evolution. Results in the paper are calculated using these data.
Note! Some outflow stage calculations continue the evolution of the previous set of files. The underlying reason for continued calculations is typically that the calculated time interval is too short. Such files are typically given the extension '_cont.lin_out', '_cont2.lin_out', etc.
Load files:
Two tools are provided here that can load the binary data files using the Interactive Data Language (IDL):
sc_load_bin (for files with the suffix '.bin'):
Loads the full content of a T-800 binary file and returns a structure
with the data.
sc_load_tpb (for files with the suffix '.tpb'):
Loads the full content of a T-800 'tpb' binary file and returns a
structure with the data.
Note! Due to the way models run on clusters, this file is sometimes
incomplete; this happens when the model code T-800 is stopped as the
cluster-specific walltime is reached. If this is the case, it is
necessary to use the binary file instead, where data are saved
typically every 20:th time step.
Alternative tools for use with Python and Julia could be considered for writing, but where not yet available when this dataset was made public. Please contact the corresponding author for a current status on this issue.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Market Size and Growth: The global legal document checking and formatting software market is projected to reach a value of X million by 2033, expanding at a CAGR of XX% from 2025 to 2033. This surge is attributed to the increasing need for accuracy, efficiency, and consistency in legal document preparation. The increasing adoption of cloud-based solutions and the rise of artificial intelligence (AI) and machine learning (ML) technologies are further driving market growth. Market Trends and Restraints: Key trends shaping the market include the adoption of cloud-based platforms for enhanced accessibility and collaboration, the use of AI and ML for automated document analysis and formatting, and the growing demand for industry-specific solutions. Market restraints include the cost of implementation, data security concerns, and the need for skilled professionals to manage and interpret the software's output. The market is also segmented by type (on-premise and cloud-based) and application (large enterprises and SMEs), with large enterprises holding a dominant market share due to their need for robust document management systems.
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The global document formatting services market is experiencing robust growth, driven by the increasing demand for professionally formatted documents across various sectors. The market's expansion is fueled by several key factors. Firstly, the proliferation of digital documents and the need for consistent branding and professional presentation across all communication channels are driving demand. Secondly, the rising complexity of document creation, particularly in fields like legal and finance, necessitates specialized formatting expertise. Businesses are increasingly outsourcing this function to focus on core competencies, leading to significant market expansion. The academic sector also contributes substantially, with students and researchers requiring formatting assistance for theses, dissertations, and research papers. While specific market size figures aren't provided, considering the growth in related sectors like digital publishing and freelance editing, a reasonable estimation for the 2025 market size could be around $2.5 billion, growing at a conservative Compound Annual Growth Rate (CAGR) of 10% over the forecast period (2025-2033). This growth is largely segmented across different application areas, with the business and legal sectors showing particularly strong demand. The service itself is divided across document types, with Word documents, PowerPoint presentations, and Excel spreadsheets representing the largest shares. North America and Europe currently hold the largest market shares, but growth potential is high in the Asia-Pacific region, driven by burgeoning economies and increased digital adoption. Despite its growth trajectory, the market faces some challenges. Competition amongst numerous providers, ranging from large outsourcing firms to individual freelancers, can lead to price pressure. The need for specialized expertise within specific document formatting standards (e.g., legal citations) requires continuous investment in training and upskilling. Moreover, concerns about data security and confidentiality within client documents are areas that providers must address effectively. The evolving technological landscape, with the potential introduction of more advanced automated formatting tools, also represents a long-term challenge. However, the ongoing demand for high-quality, error-free documentation suggests that human-driven expertise in document formatting will remain highly relevant and in demand for the foreseeable future.