Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The benchmark interest rate in the United States was last recorded at 4.50 percent. This dataset provides the latest reported value for - United States Fed Funds Rate - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The benchmark interest rate in New Zealand was last recorded at 3 percent. This dataset provides - New Zealand Interest Rate - actual values, historical data, forecast, chart, statistics, economic calendar and news.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Key information about New Zealand Long Term Interest Rate
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The benchmark interest rate in Australia was last recorded at 3.60 percent. This dataset provides - Australia Interest Rate - actual values, historical data, forecast, chart, statistics, economic calendar and news.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A biodiversity dataset graph: Biodiversity Heritage Library
Biodiversity datasets, or descriptions of biodiversity datasets, are increasingly available through open digital data infrastructures such as the Biodiversity Heritage Library (BHL, https://biodiversitylibrary.org). "The Biodiversity Heritage Library improves research methodology by collaboratively making biodiversity literature openly available to the world as part of a global biodiversity community." - https://biodiversitylibrary.org , June 2019.
However, little is known about how these networks, and the data accessed through them, change over time. This dataset provide snapshots of all OCR item texts (e.g., individual items) available through BHL as tracked by Preston (https://github.com/bio-guoda/preston , https://doi.org/10.5281/zenodo.1410543 ) over period May - June 2019.
This snapshot contains about 120GB of uncompressed OCR texts across 227k OCR BHL items. Also, a snapshot of the BHL item catalog at https://www.biodiversitylibrary.org/data/item.txt is included.
The archive consists of 256 individual parts (e.g., preston-00.tar.gz, preston-01.tar.gz, ...) to allow for parallel file downloads. The archive contains three types of files: index files, provenance files and data files. Only two index and provenance files are included and have been individually included in this dataset publication. Index files provide a way to links provenance files in time to eestablish a versioning mechanism. Provenance files describe how, when and where the BHL OCR text items were retrieved. For more information, please visit https://preston.guoda.bio or https://doi.org/10.5281/zenodo.1410543).
To retrieve and verify the downloaded BHL biodiversity dataset graph, first concatenate all the downloaded preston-*.tar.gz files (e.g., cat preston-*.tar.gz > preston.tar.gz). Then, extract the archives into a "data" folder. After that, verify the index of the archive by reproducing the following result:
$ java -jar preston.jar history
<0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/pav/hasVersion>
To check the integrity of the extracted archive, confirm that each line produce by the command "preston verify" produces lines as shown below, with each line including "CONTENT_PRESENT_VALID_HASH". Depending on hardware capacity, this may take a while.
$ java -jar preston.jar verify
hash://sha256/e0c131ebf6ad2dce71ab9a10aa116dcedb219ae4539f9e5bf0e57b84f51f22ca file:/home/preston/preston-bhl/data/e0/c1/e0c131ebf6ad2dce71ab9a10aa116dcedb219ae4539f9e5bf0e57b84f51f22ca OK CONTENT_PRESENT_VALID_HASH 49458087
hash://sha256/1a57e55a780b86cff38697cf1b857751ab7b389973d35113564fe5a9a58d6a99 file:/home/preston/preston-bhl/data/1a/57/1a57e55a780b86cff38697cf1b857751ab7b389973d35113564fe5a9a58d6a99 OK CONTENT_PRESENT_VALID_HASH 25745
hash://sha256/85efeb84c1b9f5f45c7a106dd1b5de43a31b3248a211675441ff584a7154b61c file:/home/preston/preston-bhl/data/85/ef/85efeb84c1b9f5f45c7a106dd1b5de43a31b3248a211675441ff584a7154b61c OK CONTENT_PRESENT_VALID_HASH 519892
Note that a copy of the java program "preston", preston.jar, is included in this publication. The program runs on java 8+ virtual machine using "java -jar preston.jar", or in short "preston".
Files in this data publication:
README - this file
preston-[00-ff].tar.gz - preston archives containing BHL OCR item texts, their provenance and a provenance index.
9e8c86243df39dd4fe82a3f814710eccf73aa9291d050415408e346fa2b09e70 - preston index file
2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a - preston index file
89926f33157c0ef057b6de73f6c8be0060353887b47db251bfd28222f2fd801a - preston provenance file
41b19aa9456fc709de1d09d7a59c87253bc1f86b68289024b7320cef78b3e3a4 - preston provenance file
This work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Civil Rights Data Collection (CRDC), formerly administered as the Elementary and Secondary School Civil Rights Survey, is an important part of the U.S. Department of Education's (Department) Office for Civil Rights (OCR) strategy for administering and enforcing civil rights laws in the nation’s public school districts and schools. The CRDC collects a variety of information including student access to rigorous courses, programs, resources, instructional and other school staff, and school climate factors such as student discipline and harassment and bullying. Much of the data is disaggregated by race/ethnicity, sex, disability and whether students are English Learners.Since the 2011–12 school year, OCR has collected data from all public districts and their schools in the 50 states and Washington, DC. Over time the CRDC’s collection universe has grown to include long-term secure justice facilities, charter schools, alternative schools, and special education schools that focus primarily on serving students with disabilities. OCR added the Commonwealth of Puerto Rico to the CRDC, beginning with the 2017-18 CRDC. From 1968 to 2010, civil rights data were collected from a sample of public districts and their schools, except for the 1976 and 2000 collections, which included data from all public schools and districts.The purpose of the CRDC Archival Download Tool (Archival Tool) is to make the Department’s civil rights data from 1968 to 1998 publicly available. The Archival Tool organizes civil rights data by year, and provides users with access to the data, survey forms, and other relevant documentation. The tool also includes documentation on key historical CRDC data changes from 1968 to 1998. Users may extract district-level civil rights data.Important Consideration: Past collections and publicly released reports may contain some terms that readers may consider obsolete, offensive and/or inappropriate. As part of the Department’s goal to be open and transparent with the public, we are providing access to all civil rights data in its original format.Privacy notice:The Department of Education’s Disclosure Review Board determined that the CRDC files for 1968-1998 are safe for public “re-release” under the Family Educational Rights and Privacy Act (FERPA) (20 U.S.C. § 1232g; 34 CFR Part 99).
Since 1968, OCR has collected civil rights data related to students' access and barriers to educational opportunity from early childhood through grade 12. These data are collected from all public schools and districts, as well as long-term secure juvenile justice facilities, charter schools, alternative schools, and special education schools that focus primarily on serving the educational needs of students with disabilities under IDEA or section 504 of the Rehabilitation Act. The CRDC collects information about student enrollment; access to courses, programs and school staff; and school climate factors, such as bullying, harassment and student discipline. Most data collected by the CRDC are disaggregated by race, ethnicity, sex, disability, and English Learners. Originally known as the Elementary and Secondary School Civil Rights Survey, OCR began by collecting data every year from 1968 to 1974 from a sample of school districts and their schools. Over time, the schedule and approach to data collection has changed. Since the 2011-12 collection, the CRDC has been administered every two years to all public school districts and schools in the 50 states and Washington, D.C., and OCR added the Commonwealth of Puerto Rico for the 2017-18 CRDC. Due to the COVID-19 pandemic that resulted in school closures nationwide, OCR postponed the 2019-20 CRDC and instead collected data from the 2020-21 school year.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is composed of pages and entries extracted from French directories published between 1798 and 1861.
The purpose of this dataset is to evaluate the performance of Optical Character Recognition (OCR) and Named Entity Recognition (NER) on 19th century French documents.
This dataset is divided into two parts:
A labeled dataset, which contains 8765 manually corrected entries from 78 pages (18 different directories), and which is designed for supervised training.
An unlabeled dataset, containing 1058196 raw entries from 6887 pages (13 different directories), and which is designed for self-supervised pre-training.
For the labeled dataset, we provide:
Original pages and cropped images
Human-corrected positions, transcriptions and entity tagging for each entry
OCR prediction from 3 systems (Tesseract v4, PERO OCR v2020 and Kraken)
Projected NER reference from clean text to OCR predictions, making it suitable to evaluate the performance of NER systems on real, noisy OCR predictions
For the unlabeled dataset, we provide:
Automatically detected positions for each entry (lot of noise)
OCR predictions for each entry (PERO OCR engine)
How to cite this dataset Please cite this dataset as:
N. Abadie, S. Baciocchi, E. Carlinet, J. Chazalon, P. Cristofoli, B. Duménieu and J. Perret, A Dataset of French Trade Directories from the 19th Century (FTD), version 1.0.0, May 2022, online at https://doi.org/10.5281/zenodo.6394464.
@dataset{abadie_dataset_22, author = {Abadie, Nathalie and Bacciochi, St{\'e}phane and Carlinet, Edwin and Chazalon, Joseph and Cristofoli, Pascal and Dum{\'e}nieu, Bertrand and Perret, Julien}, title = {{A} {D}ataset of {F}rench {T}rade {D}irectories from the 19th {C}entury ({FTD})}, month = mar, year = 2022, publisher = {Zenodo}, version = {v1.0.0}, doi = {10.5281/zenodo.6394464}, url = {https://doi.org/10.5281/zenodo.6394464} }
You may also be interested in our paper presented at DAS 2022 (15th IAPR International Workshop on Document Analysis Systems), which compares the performance of OCR and NER systems on this dataset:
N. Abadie, E. Carlinet, J. Chazalon and B. Duménieu, A Benchmark of Named Entity Recognition Approaches in Historical Documents — Application to 19th Century French Directories, May 2022, La Rochelle, France, Springer.
@inproceedings{abadie_das_22, author = {Abadie, Nathalie and Carlinet, Edwin and Chazalon, Joseph and Dum{\'e}nieu, Bertrand}, title = {{A} {B}enchmark of {N}amed {E}ntity {R}ecognition {A}pproaches in {H}istorical {D}ocuments — {A}pplication to 19th {C}entury {F}rench {D}irectories}, month = may, year = 2022, publisher = {Springer}, place = {La Rochelle, France} }
Copyright and License
The images were extracted from the original source https://gallica.bnf.fr, owned by the Bibliothèque nationale de France (French national library).
Original contents from the Bibliothèque nationale de France can be reused non-commercially, provided the mention "Source gallica.bnf.fr / Bibliothèque nationale de France" is kept.
Researchers do not have to pay any fee for reusing the original contents in research publications or academic works.
Original copyright mentions extracted from https://gallica.bnf.fr/edit/und/conditions-dutilisation-des-contenus-de-gallica on March 29, 2022.
The original contents were significantly transformed before being included in this dataset. All derived content is licensed under the permissive Creative Commons Attribution 4.0 International license.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionSeveral lifestyle factors promote protection against Alzheimer's disease (AD) throughout a person's lifespan. Although such protective effects have been described for occupational cognitive requirements (OCR) in midlife, it is currently unknown whether they are conveyed by brain maintenance (BM), brain reserve (BR), or cognitive reserve (CR) or a combination of them.MethodsWe systematically derived hypotheses for these resilience concepts and tested them in the population-based AgeCoDe cohort and memory clinic-based AD high-risk DELCODE study. The OCR score (OCRS) was measured using job activities based on the O*NET occupational classification system. Four sets of analyses were conducted: (1) the interaction of OCR and APOE-ε4 with regard to cognitive decline (N = 2,369, AgeCoDe), (2) association with differentially shaped retrospective trajectories before the onset of dementia of the Alzheimer's type (DAT; N = 474, AgeCoDe), (3) cross-sectional interaction of the OCR and cerebrospinal fluid (CSF) AD biomarkers and brain structural measures regarding memory function (N = 873, DELCODE), and (4) cross-sectional and longitudinal association of OCR with CSF AD biomarkers and brain structural measures (N = 873, DELCODE).ResultsRegarding (1), higher OCRS was associated with a reduced association of APOE-ε4 with cognitive decline (mean follow-up = 6.03 years), consistent with CR and BR. Regarding (2), high OCRS was associated with a later onset but subsequently stronger cognitive decline in individuals converting to DAT, consistent with CR. Regarding (3), higher OCRS was associated with a weaker association of the CSF Aβ42/40 ratio and hippocampal volume with memory function, consistent with CR. Regarding (4), OCR was not associated with the levels or changes in CSF AD biomarkers (mean follow-up = 2.61 years). We found a cross-sectional, age-independent association of OCRS with some MRI markers, but no association with 1-year-change. OCR was not associated with the intracranial volume. These results are not completely consistent with those of BR or BM.DiscussionOur results support the link between OCR and CR. Promoting and seeking complex and stimulating work conditions in midlife could therefore contribute to increased resistance to pathologies in old age and might complement prevention measures aimed at reducing pathology.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
AbstractObjective To assess over 3 years of follow-up, the effects of maintaining or switching to ocrelizumab (OCR) therapy on clinical and MRI outcomes and safety measures in the open-label extension (OLE) phase of the pooled OPERA studies in relapsing multiple sclerosis. Methods After 2 years of double-blind, controlled treatment, patients continued OCR (600 mg infusions every 24 weeks) or switched from interferon (IFN) β-1a (44 μg 3 times weekly) to OCR when entering the OLE phase (3 years). Adjusted annualized relapse rate, time to onset of 24-week confirmed disability progression/improvement (CDP/CDI), brain MRI activity (gadolinium-enhanced and new/enlarging T2 lesions), and percentage brain volume change were analyzed. Results Of patients entering the OLE phase, 88.6% completed Year 5. The cumulative proportion with 24-week CDP was lower in patients who initiated OCR earlier, vs patients initially receiving IFN β-1a (16.1% vs 21.3% at Year 5; p=0.014). Patients continuing OCR maintained, and those switching from IFN β-1a to OCR attained near complete and sustained suppression of new brain MRI lesion activity from Year 3 to 5. Over the OLE phase, patients continuing OCR exhibited less whole brain volume loss from double-blind study baseline vs those switching from IFN β-1a (–1.87% vs –2.15% at Year 5; p<0.01). Adverse events were consistent with past reports and no new safety signals emerged with prolonged treatment. Conclusion Compared with patients switching from IFN β-1a, earlier and continuous OCR treatment up to 5 years provided sustained benefit on clinical and MRI measures of disease progression.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset contentsThis dataset is an OCR text corpus of 2,984 printed works (monographs and serials) from the collection of the Qatar National Library. All works are mostly in Arabic language, but fragments of texts in other languages can also be found. Besides the OCR text, the basic descriptive metadata for each item is also provided.Release note for version 2 of the datasetThe dataset of OCRed Arabic books has been fully updated to ensure consistency and quality. All items in the dataset have now been processed using the latest retrained data. Furthermore, every item has undergone a thorough visual quality assurance check conducted using a representative sample of pages. This update has resulted in a significant enhancement of word-level accuracy across the entire dataset, ensuring higher reliability and usability.The exact list of files changed between version 1 and version 2 of the dataset can be determined by comparing the SHA256 checksums provided with each dataset version (see below for details).Dataset structureThe dataset consists of three files:QNL-ArabicContentDataset-Metadata.csv and QNL-ArabicContentDataset-Metadata.xlsx contain the same basic metadata of 2,894 items from the Qatar National Library collection. Both files have the same content and are structured into the following columns:CALL #(ITEM) - Item call number in the QNL catalogRECORD #(ITEM) - Item record number in the QNL catalog (unique for each item)Repository URL - URL to digitized item content in the QNL repositoryCatalog URL - URL to the complete item metadata record in the QNL catalogAUTHOR - Main author information for the itemADD AUTHOR - Additional author information for the itemPUB INFO - Item publication infoTITLE - Item titleDESCRIPTION - Item descriptionVOLUME - Item volume information (in case of some serial publications)QNL_ArabicOCR_Corpus-v2.zip contains:2,894 text files with the following naming pattern: [unique item record number]-[unique item QNL repository id].txt. The unique item record number should be used to match each file with a related metadata record. Each file contains text extracted from a particular item using OCR software.checksums.sha256 - contains SHA256 checksums for all 2,894 text files
The Reserve Bank of Australia's (RBA) cash rate target in-part determines interest rates on financial products.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Corpus for the ICDAR2019 Competition on Post-OCR Text Correction (October 2019)Christophe Rigaud, Antoine Doucet, Mickael Coustaty, Jean-Philippe Moreuxhttp://l3i.univ-larochelle.fr/ICDAR2019PostOCR-------------------------------------------------------------------------------These are the supplementary materials for the ICDAR 2019 paper ICDAR 2019 Competition on Post-OCR Text CorrectionPlease use the following citation:@inproceedings{rigaud2019pocr,title=""ICDAR 2019 Competition on Post-OCR Text Correction"",author={Rigaud, Christophe and Doucet, Antoine and Coustaty, Mickael and Moreux, Jean-Philippe},year={2019},booktitle={Proceedings of the 15th International Conference on Document Analysis and Recognition (2019)}}
Description: The corpus accounts for 22M OCRed characters along with the corresponding Gold Standard (GS). The documents come from different digital collections available, among others, at the National Library of France (BnF) and the British Library (BL). The corresponding GS comes both from BnF's internal projects and external initiatives such as Europeana Newspapers, IMPACT, Project Gutenberg, Perseus and Wikisource. Repartition of the dataset- ICDAR2019_Post_OCR_correction_training_18M.zip: 80% of the full dataset, provided to train participants' methods.- ICDAR2019_Post_OCR_correction_evaluation_4M: 20% of the full dataset used for the evaluation (with Gold Standard made publicly after the competition).- ICDAR2019_Post_OCR_correction_full_22M: full dataset made publicly available after the competition. Special case for Finnish language Material from the National Library of Finland (Finnish dataset FI > FI1) are not allowed to be re-shared on other website. Please follow these guidelines to get and format the data from the original website.1. Go to https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en;2. Download OCR Ground Truth Pages (Finnish Fraktur) [v1](4.8GB) from Digitalia (2015-17) package;3. Convert the Excel file ""~/metadata/nlf_ocr_gt_tescomb5_2017.xlsx"" as Comma Separated Format (.csv) by using save as function in a spreadsheet software (e.g. Excel, Calc) and copy it into ""FI/FI1/HOWTO_get_data/input/"";4. Go to ""FI/FI1/HOWTO_get_data/"" and run ""script_1.py"" to generate the full ""FI1"" dataset in ""output/full/"";4. Run ""script_2.py"" to split the ""output/full/"" dataset into ""output/training/"" and ""output/evaluation/"" sub sets.At the end of the process, you should have a ""training"", ""evaluation"" and ""full"" folder with 1579528, 380817 and 1960345 characters respectively.
Licenses: free to use for non-commercial uses, according to sources in details- BG1: IMPACT - National Library of Bulgaria: CC BY NC ND- CZ1: IMPACT - National Library of the Czech Republic: CC BY NC SA- DE1: Front pages of Swiss newspaper NZZ: Creative Commons Attribution 4.0 International (https://zenodo.org/record/3333627)- DE2: IMPACT - German National Library: CC BY NC ND- DE3: GT4Hist-dta19 dataset: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE4: GT4Hist - EarlyModernLatin: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE5: GT4Hist - Kallimachos: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE6: GT4Hist - RefCorpus-ENHG-Incunabula: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- DE7: GT4Hist - RIDGES-Fraktur: CC-BY-SA 4.0 (https://zenodo.org/record/1344132)- EN1: IMPACT - British Library: CC BY NC SA 3.0- ES1: IMPACT - National Library of Spain: CC BY NC SA- FI1: National Library of Finland: no re-sharing allowed, follow the above section to get the data. (https://digi.kansalliskirjasto.fi/opendata)- FR1: HIMANIS Project: CC0 (https://www.himanis.org)- FR2: IMPACT - National Library of France: CC BY NC SA 3.0- FR3: RECEIPT dataset: CC0 (http://findit.univ-lr.fr)- NL1: IMPACT - National library of the Netherlands: CC BY- PL1: IMPACT - National Library of Poland: CC BY- SL1: IMPACT - Slovak National Library: CC BY NCText post-processing such as cleaning and alignment have been applied on the resources mentioned above, so that the Gold Standard and the OCRs provided are not necessarily identical to the originals.
Structure- **Content** [./lang_type/sub_folder/#.txt] - ""[OCR_toInput] "" => Raw OCRed text to be de-noised. - ""[OCR_aligned] "" => Aligned OCRed text. - ""[ GS_aligned] "" => Aligned Gold Standard text.The aligned OCRed/GS texts are provided for training and test purposes. The alignment was made at the character level using ""@"" symbols. ""#"" symbols correspond to the absence of GS either related to alignment uncertainties or related to unreadable characters in the source document. For a better view of the alignment, make sure to disable the ""word wrap"" option in your text editor.The Error Rate and the quality of the alignment vary according to the nature and the state of degradation of the source documents. Periodicals (mostly historical newspapers) for example, due to their complex layout and their original fonts have been reported to be especially challenging. In addition, it should be mentioned that the quality of Gold Standard also varies as the dataset aggregates resources from different projects that have their own annotation procedure, and obviously contains some errors.
ICDAR2019 competitionInformation related to the tasks, formats and the evaluation metrics are details on :https://sites.google.com/view/icdar2019-postcorrectionocr/evaluation
References - IMPACT, European Commission's 7th Framework Program, grant agreement 215064 - Uwe Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter (2018). Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. - https://digi.nationallibrary.fi , Wiipuri, 31.12.1904, Digital Collections of National Library of Finland- EU Horizon 2020 research and innovation programme grant agreement No 770299
Contact- christophe.rigaud(at)univ-lr.fr- antoine.doucet(at)univ-lr.fr- mickael.coustaty(at)univ-lr.fr- jean-philippe.moreux(at)bnf.frL3i - University of la Rochelle, http://l3i.univ-larochelle.frBnF - French National Library, http://www.bnf.fr
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Civil Rights Data Collection (CRDC), formerly administered as the Elementary and Secondary School Civil Rights Survey, is an important part of the U.S. Department of Education's (Department) Office for Civil Rights (OCR) strategy for administering and enforcing civil rights laws in the nation’s public school districts and schools. The CRDC collects a variety of information including student access to rigorous courses, programs, resources, instructional and other school staff, and school climate factors such as student discipline and harassment and bullying. Much of the data is disaggregated by race/ethnicity, sex, disability and whether students are English Learners.Since the 2011–12 school year, OCR has collected data from all public districts and their schools in the 50 states and Washington, DC. Over time the CRDC’s collection universe has grown to include long-term secure justice facilities, charter schools, alternative schools, and special education schools that focus primarily on serving students with disabilities. OCR added the Commonwealth of Puerto Rico to the CRDC, beginning with the 2017-18 CRDC. From 1968 to 2010, civil rights data were collected from a sample of public districts and their schools, except for the 1976 and 2000 collections, which included data from all public schools and districts.The purpose of the CRDC Archival Download Tool (Archival Tool) is to make the Department’s civil rights data from 1968 to 1998 publicly available. The Archival Tool organizes civil rights data by year, and provides users with access to the data, survey forms, and other relevant documentation. The tool also includes documentation on key historical CRDC data changes from 1968 to 1998. Users may extract district-level civil rights data. For instructions and information on using the Archival Data Download Tool, please view this page.Important Consideration: Past collections and publicly released reports may contain some terms that readers may consider obsolete, offensive and/or inappropriate. As part of the Department’s goal to be open and transparent with the public, we are providing access to all civil rights data in its original format.Privacy notice:The Department of Education’s Disclosure Review Board determined that the CRDC files for 1968-1998 are safe for public “re-release” under the Family Educational Rights and Privacy Act (FERPA) (20 U.S.C. § 1232g; 34 CFR Part 99).Data were collected via a sample of school districts and all individual schools within these districts.Related Projects:CRDC 2000: https://www.datalumos.org/datalumos/project/218422/viewCRDC 2004: https://www.datalumos.org/datalumos/project/218423/viewCRDC 2006: https://www.datalumos.org/datalumos/project/218424/viewCRDC 2009-2010: https://www.datalumos.org/datalumos/project/218425/viewCRDC 2013-2014: https://www.datalumos.org/datalumos/project/100445/viewCRDC 2015 - 2016: https://www.datalumos.org/datalumos/project/103004/view
This dataset consists of 156 pages of Romanian texts written in the Romanian Transitional Script (RTS). RTS is a mix of Latin and Cyrillic characters that were used in the 19th century in the Romanian provinces to facilitate the transition from the Romanian Cyrillic Script to the modern Latin Script. The images cover the period between 1833 and 1864. The selected texts cover a diverse range of literary genres, including poems, novels, dramas, stories, newspapers, and religious texts.
The dataset was obtained from the Central University Libraries (BCU) of Timișoara, Iași, and Cluj-Napoca through their free online platforms or by request. The scanned images are provided in JPEG and PNG formats, with dimensions ranging from approximately 300 by 900 pixels to 2000 by 3000 pixels. The file sizes vary between 70 KB and 10 MB.
To ensure diversity, the dataset includes images with various fonts, styles, regions, publishers, and years. It covers all three main Romanian provinces' key publishing regions (Bucharest - B, Iasi - IS, Brasov - BV, Sibiu - SB, Blaj - BJ) as well as some located outside Romania that printed texts in RTS (Vienna - V, Budapest - BD, Paris - P). It comprises 4588 lines of text, totaling 31,132 words and 158,656 characters. Among these characters, there are 61,065 Cyrillic characters, 27,022 Latin characters, 53,844 overlapping characters (identical symbols), and 16,725 other characters (e.g., punctuation, digits). The images below summarize its content per publisher and decade. More statistics (including per publishing house and per character) are available in the code provided.
Statistics of characters in the dataset per publisher and decade*
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15661653%2F13bd86216df169b5c4783813a4b5118f%2Fchar-count.png?generation=1687532923729343&alt=media" alt="">
Percentage of Latin vs. Cyrillic vs. other characters in the dataset*
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15661653%2F0cfad1574aa2823b798fcf2b515beff6%2Fchar-ratio.png?generation=1687532980067286&alt=media" alt="">
The dataset presents typical challenges found in old documents, such as wear and tear, blemishes, discolorations, library imprints, handwriting, ink smudges, and variations in text alignment. These factors may impact legibility, and some scanned lines of text may not be uniformly straight.
This dataset provides a valuable resource for researchers and practitioners interested in historical document analysis, transliteration techniques, and studying the evolution of the Romanian language. It allows for the development and evaluation of OCR models and other language processing techniques in the context of the Romanian Transitional Script. The images provided are accompanied by ground truth texts (.gt.txt files) containing the correct text found in them, as well as .box files for the Tesseract 5 OCR engine.
You may use the dataset freely as long as you mention this page or the project below.
This work was supported by a grant of the Romanian Ministry of Research, Innovation and Digitization, CCCDI - UEFISCDI, project number PN-III-P2-2.1-PED-2021-0693, within PNCDI III. Project website: ROTLA
*Plots are based on the original dataset distribution
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
The Aida Calculus Math Handwriting Recognition Dataset consists of 100,000 images in 10 batches. Each image contains a photo of a handwritten calculus math expression (specifically within the topic of limits) written with a dark utensil on plain paper. Each image is accompanied by ground truth math expression in LaTeX as well as bounding boxes and pixel-level masks per character. All images are synthetically generated.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2F67bf0c680286baf2c979c8207a991bb2%2FScreen%20Shot%202020-08-19%20at%201.02.50%20PM.png?generation=1597868629120369&alt=media%20=500x100" alt="">
The complexity of handwriting recognition for math expressions can be decomposed into the following sources of variability:
Image of Math = Math Expression x Math Characters x Location of Math Characters x Visual Qualities of the Math Characters (fonts, color) x Noise of Image (backgrounds, stray marks)
It is the job of the recognition model to take the Image of Math as input and predict the Math Expression.
Typical approaches to handwritten recognition tasks involve collecting and tagging of large amounts of data, on which many iterations of models are trained. The "one dataset, many models" paradigm has specific drawbacks within the context of product development. As product requirements evolve, such as the addition of a new mathematical character into the prediction space, a new data collection and tagging effort must be undertaken. The cycle of adapting the handwriting recognition capability to new requirements is long and does not support agile product development.
Here, we take a different approach by iteratively building a complex, synthetically generated dataset towards specific requirements. The generation process delivers exact control over the distribution of math expressions, characters, location of characters, specific visual qualities of the math, image noise, and image augmentations to the developer. The developer controls every aspect of the data, down to each pixel. In many ways, the data synthesis runs backwards to the handwriting recognition model, creating visual complexity that the model must then untangle to uncover the ground truth math expression. Thus, we can arrive at a "many datasets, one model" paradigm that as product requirements change, the data can quickly iterate and adapt on agile cycles.
In addition to affording more control over the product development process, synthetic data allows for 100% correct pixel by pixel tagging that opens the door for new modeling possibilities. Every image is tagged with the ground truth LaTeX for the expressions, bounding boxes per math character, and exact pixel masks for each character.
Our goal in releasing this dataset is to provide the data science and machine learning community with resources for undertaking the challenging computer vision task of extracting math expressions from images. The data offers something to all levels, from beginners building simple character recognition models to experts who wish to predict pixel-by-pixel masks and decode the complex structure of math expressions.
The images contain math expressions of limits, a topic typically encountered by students learning Calculus I in the United States. Features of the writing such as font, writing utensils (type, color, pressure, consistency), angle and distance of photo, and size of writing are all simulated. Backgrounds features include shadows, various plain paper types, bleed throughs, other distortions, and noise typical of student taking photos of their math.
The strategy in defining the populations from which images are synthesized is to be a superset of what we expect students to submit. Therefore, the math expressions are not in themselves pedagogical, but aim to encompass the potential variety of student submissions, both mathematically correct and incorrect. The image features and augmentations are similarly designed to cover the range of possible student handwriting qualities.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2F78c49b9673f8d07c91cd5c929e50ed13%2FPicture2.png?generation=1597361067979205&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2F38f70b6a773709eb02578f20634e8433%2FPicture1.png?generation=1597361068613807&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2F17a3a78ac635cd728f9d6ef32609aee8%2FPicture3.png?generation=1597361068784034&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5602706%2Fc052749a8085d66aa7bf97c78a4b6c6a%2FPicture4.png?generation=1597361068949074&alt=media%20=250x100" alt="">
Data consis...
Quarter 3 2021 (July to September) was a period in which the final stages of COVID-19 restrictions were lifted in England and Wales. Adapted vocational and other qualifications continued to be permitted, in line with our regulatory arrangements (please see the background notes for more information). The trends seen in this quarter may have been affected as a result.
The number of certificates awarded in 2021 quarter 3 was just over 2.2 million, a 12% increase from quarter 3 of 2020. This has also meant that the number of certificates issued in the 12 months leading up to the end of quarter 3 2021 (4,596,775) has also increased relative to the number certificates issued in the 12 months leading to the end of quarter 2 2021 (4,364,655). This confirms the change of trend already seen in quarter 2, when an increase in these figures started raising for the first time after the COVID-19 pandemic.
There was an increase in the number of certificates awarded for almost all qualification levels between quarter 3 of 2020 and quarter 3 of 2021, with the largest increase in numbers being for Level 3 qualifications (from 541,385 certificates in quarter 3 of 2020 to 687,555 certificates in quarter 3 of 2021, an increase of 27%). The only qualification level which saw a decrease in certificates awarded from quarter 3 2020 to quarter 3 2021 was Entry Level qualifications with a decrease of 4%. The number of certificates issued for Level 8 qualifications remained the same.
Quarter 3 of 2021 saw increases in the number of certificates awarded for most sector subject areas. The largest increase was seen for Arts, Media, and Publishing, which saw an increase from 193,820 certificates in quarter 3 of 2020 to 303,485 certificates in quarter 3 of 2021 (an increase of 57%). The largest decrease in certificates from quarter 3 of 2020 was seen for Preparation for Life and Work, dropping from 476,380 certificates to 366,550 in quarter 3 of 2021, a decrease of 23%.
Quarter 3 2021 saw a significant decrease compared to quarter 3 2020 in the number of certificates awarded for Functional Skills qualifications (from 255,955 certificates in quarter 3 2020 to 142,510 certificates in quarter 3 2021, a decrease of 44%). This may be, in part, due to ESFA permitting temporary flexibilities on when Functional Skills qualifications had to be taken as part of apprenticeships and only required apprentices to achieve Level 1. Most qualification types, however, saw an increase this quarter compared to 2020 quarter 3.
The 2 qualifications with the highest number of certificates awarded in quarter 3 2021 were OCR Level 1/2 Cambridge National Certificate in Creative iMedia, followed by Pearson BTEC Level 1/Level 2 First Award in Sport. These qualifications had the higher number of certificates awarded in quarter 3 2020.
The awarding organisation with the highest number of certificates issued in this quarter was Pearson, followed by City and Guilds and OCR. In quarter 3 2021, Pearson saw a 5% decrease in the number of certificates awarded (down 33,005) compared to quarter 3 2020. City and Guilds and OCR saw a 5% and 1% decrease in the number of certificates compared to quarter 3 2020 respectively.
Over the whole year, Pearson had the highest number of certificates issued, followed by City and Guilds and NCFE.
The dataset used to produce this release is available separately.
All our published vocational and other qualifications publications are available as part of the collection for vocational qualifications statistics.
We welcome your feedback on our publications. Should you have any comments on this statistical release and how to improve it to meet your needs please contact us at data.analytics@ofqual.gov.uk.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This repository contains the dataset of the article "Towards a general open dataset and models for late medieval Castilian writing (HTR/OCR)" submitted to the Journal of Data Mining and Digital Humanities (JDMDH). I refer to the paper (https://doi.org/10.5281/zenodo.7387376) for the description of the corpus and the models.
The dataset is in version V2: it contains the allographetic AND graphematic transcriptions (files *.normalized.xml
) and models.
Caveat: the allographetic transcriptions and models only are described in the data paper mentionned above. The graphematic transcriptions are produced using a Chocomuffin conversion table (see corpus/conversion_table.csv
) to reduce each allograph to its corresponding grapheme. The abbreviations are not expanded.
Please cite the following paper if you use this dataset or the models:
@article{gille_levenson_2023_towards, author = {Gille Levenson, Matthias}, date = {2023}, journaltitle = {Journal of Data Mining and Digital Humanities}, doi = {10.46298/jdmdh.10416}, editor = {Pinche, Ariane and Stokes, Peter}, issuetitle = {Special Issue: Historical documents and automatic text recognition}, title = {Towards a general open dataset and models for late medieval Castilian text recognition(HTR/OCR)},
GILLE LEVENSON , Matthias, « Towards a general open dataset and models for late medieval Castiliantext recognition (HTR/OCR) », Journal of Data Mining and Digital Humanities (2023) : SpecialIssue : Historical documents and automatic text recognition, eds. Ariane PINCHE and PeterSTOKES, DOI : 10.46298/jdmdh.10416.
The image of the manuscript M (Esc_M) has not yet been uploaded, pending permission from the library that keeps the manuscript.
All images are kept in a directory named after the place where the manuscript is kept, and the sigla of the witness for the in-domain dataset.
The global licence for the dataset (except for images) is CC-BY-NC-SA. All manuscripts reproductions are published with the authorization of the libraries. ©Biblioteca General Histórica de Salamanca Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2709 (L) Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2097 (J) Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2673 Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2011 Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2654 Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2086 ©Museo Lázaro Galdiano. Madrid Inv. 15304, Fundación Lázaro Galdiano (A) ©Universidad de Valladolid Ms. 251, Biblioteca Santa Cruz (S) ©Real Biblioteca del Escorial Ms. K.I.5, Biblioteca del Real Monasterio del Escorial (Q) Ms. h.I.8, Biblioteca del Real Monasterio del Escorial (M): to be published Ms. Z-I-12 Ms.Z-III-9 Ms. X-III-4 Ms. h-III-9 Ms. b-IV-15 Ms. b-II-11 Ms. a-II-17 Ms. T-III-5 ©Rosenbach Foundation Ms. 482/2 (U) © Gallica.bnf.fr Espagnol 12 Espagnol 36 Espagnol 218 © Bodleian Library Ms. Span. d. 1 Ms. Span. d. 2/1 © Biblioteca Real, Madrid Ms. II/215 (G) © Biblioteca Nacional de España Mss/4183 Inc/901 (Z) © Biblioteca Universitaria, Sevilla Ms. 332/131 (R)
Edit: add result files
Information on the Civil Rights Data Collection (CRDC), specifically geared for use of the CRDC in the development of State and local report cards. Data on this website is derived directly from the national public-use data file the Office for Civil Rights (OCR) prepares of the CRDC data. OCR places a high priority on ensuring the accuracy of CRDC data. The data submission system includes a series of embedded edit checks to ensure significant data errors are corrected before a district submits its data. Additionally, each district is required to certify the accuracy of its submission. Although OCR performs these robust data quality checks, there are known data issues and concerns. More information on data quality can be found in the 2015-16 CRDC Data Notes (https://res1ocrdatad-o-tedd-o-tgov.vcapture.xyz/DataNotes)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pseudoxanthoma elasticum (PXE) is a genetic disease considered as a paradigm of ectopic mineralization disorders, being characterized by multisystem clinical manifestations due to progressive calcification of skin, eyes, and the cardiovascular system, resembling an age-related phenotype. Although fibroblasts do not express the pathogenic ABCC6 gene, nevertheless these cells are still under investigation because they regulate connective tissue homeostasis, generating the “arena” where cells and extracellular matrix components can promote pathologic calcification and where activation of pro-osteogenic factors can be associated to pathways involving mitochondrial metabolism. The aim of the present study was to integrate structural and bioenergenetic features to deeply investigate mitochondria from control and from PXE fibroblasts cultured in standard conditions and to explore the role of mitochondria in the development of the PXE fibroblasts’ pathologic phenotype. Proteomic, biochemical, and morphological data provide new evidence that in basal culture conditions (1) the protein profile of PXE mitochondria reveals a number of differentially expressed proteins, suggesting changes in redox balance, oxidative phosphorylation, and calcium homeostasis in addition to modified structure and organization, (2) measure of oxygen consumption indicates that the PXE mitochondria have a low ability to cope with a sudden increased need for ATP via oxidative phosphorylation, (3) mitochondrial membranes are highly polarized in PXE fibroblasts, and this condition contributes to increased reactive oxygen species levels, (4) ultrastructural alterations in PXE mitochondria are associated with functional changes, and (5) PXE fibroblasts exhibit a more abundant, branched, and interconnected mitochondrial network compared to control cells, indicating that fusion prevail over fission events. In summary, the present study demonstrates that mitochondria are modified in PXE fibroblasts. Since mitochondria are key players in the development of the aging process, fibroblasts cultured from aged individuals or aged in vitro are more prone to calcify, and in PXE, calcified tissues remind features of premature aging syndromes; it can be hypothesized that mitochondria represent a common link contributing to the development of ectopic calcification in aging and in diseases. Therefore, ameliorating mitochondrial functions and cell metabolism could open new strategies to positively regulate a number of signaling pathways associated to pathologic calcification.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The benchmark interest rate in the United States was last recorded at 4.50 percent. This dataset provides the latest reported value for - United States Fed Funds Rate - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.