The data sets provide the text and detailed numeric information in all financial statements and their notes extracted from exhibits to corporate financial reports filed with the Commission using eXtensible Business Reporting Language (XBRL).
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
We created a dataset of clinical action items annotated over MIMIC-III. This dataset, which we call CLIP, is annotated by physicians and covers 718 discharge summaries, representing 107,494 sentences. Annotations were collected as character-level spans to discharge summaries after applying surrogate generation to fill in the anonymized templates from MIMIC-III text with faked data. We release these spans, their aggregation into sentence-level labels, and the sentence tokenizer used to aggregate the spans and label sentences. We also release the surrogate data generator, and the document IDs used for training, validation, and test splits, to enable reproduction. The spans are annotated with 0 or more labels of 7 different types, representing the different actions that may need to be taken: Appointment, Lab, Procedure, Medication, Imaging, Patient Instructions, and Other. We encourage the community to use this dataset to develop methods for automatically extracting clinical action items from discharge summaries.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
FINGAP07 NUMBER OF FINANCIAL STATEMENTS AND NOTES TO ACCOUNTS PRODUCED
This dataset is the public medical text record (progress notes) written in Japanese. Any researchers can use this dataset without privacy issues. CC BY-NC 4.0 crowd.zip: 9,756 pseudo progress notes written by crowd workers crowd_evaluated.zip: 83 pseudo progress notes with authentic quality written by crowd workers MD.zip: 19 pseudo progress notes written by medical doctors Reference: Kagawa, R., Baba, Y., & Tsurushima, H. (2021, December). A practical and universal framework for generating publicly available medical notes of authentic quality via the power of crowds. In 2021 IEEE International Conference on Big Data (Big Data) (pp. 3534-3543). IEEE. http://hdl.handle.net/2241/0002002333 The supplemental files of the paper are here: https://github.com/rinabouk/HMData2021 {"references": ["Kagawa, R., Baba, Y., & Tsurushima, H. (2021, December). A practical and universal framework for generating publicly available medical notes of authentic quality via the power of crowds. In 2021 IEEE International Conference on Big Data (Big Data) (pp. 3534-3543). IEEE. http://hdl.handle.net/2241/0002002333"]}
Financial statements: Notes to Financial Statements
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Data includes information regarding session notes from sixty-three institutions, including blank session note forms, data sets of completed session notes, and survey data about how sessions notes are conceived of, and used, in writing centers.
Meeting notes from Interagency Data Team meetings. These are best attempt to capture notable comments and questions from attendees. Notes are paraphrased. Please reference presentation or contact open.data@dc.gov with questions. The Interagency Data Team is a community of data analysts, or agency liaisons, who convene regularly with representation from DC agencies of all persuasions. Participants engage in discussions regarding the team’s core mission and priorities for a better kind of data culture – collection, application, sharing, classification and governance to name a few. The team is coordinated by the Office of the Chief Technology Officer (OCTO), lead by the Chief Data Officer (CDO), and directly supports the District of Columbia's Data Policy.Role of the DC State Data CenterEnterprise Dataset Inventory, Lessons from Pilot AgenciesEnterprise Dataset Inventory, A General Counsel’s Perspective
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
The advent of large, open access text databases has driven advances in state-of-the-art model performance in natural language processing (NLP). The relatively limited amount of clinical data available for NLP has been cited as a significant barrier to the field's progress. Here we describe MIMIC-IV-Note: a collection of deidentified free-text clinical notes for patients included in the MIMIC-IV clinical database. MIMIC-IV-Note contains 331,794 deidentified discharge summaries from 145,915 patients admitted to the hospital and emergency department at the Beth Israel Deaconess Medical Center in Boston, MA, USA. The database also contains 2,321,355 deidentified radiology reports for 237,427 patients. All notes have had protected health information removed in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor provision. All notes are linkable to MIMIC-IV providing important context to the clinical data therein. The database is intended to stimulate research in clinical natural language processing and associated areas.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the Turkish Sticky Notes Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Turkish language.
Dataset Contain & Diversity:Containing more than 2000 images, this Turkish OCR dataset offers a wide distribution of different types of sticky note images. Within this dataset, you'll discover a variety of handwritten text, including quotes, sentences, and individual words on sticky notes. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.
To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Turkish text.
The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.
All these sticky notes were written and images were captured by native Turkish people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.
Metadata:In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.
This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Turkish text recognition models.
Update & Custom Collection:We are committed to continually expanding this dataset by adding more images with the help of our native Turkish crowd community.
If you require a customized OCR dataset containing sticky note images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.
Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.
License:This image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:Leverage this sticky notes image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Turkish language. Your journey to improved language understanding and processing begins here.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
iEEG and EEG data from 5 centers is organized in our study with a total of 100 subjects. We publish 4 centers' dataset here due to data sharing issues.
Acquisitions include ECoG and SEEG. Each run specifies a different snapshot of EEG data from that specific subject's session. For seizure sessions, this means that each run is a EEG snapshot around a different seizure event.
For additional clinical metadata about each subject, refer to the clinical Excel table in the publication.
NIH, JHH, UMMC, and UMF agreed to share. Cleveland Clinic did not, so requires an additional DUA.
All data, except for Cleveland Clinic was approved by their centers to be de-identified and shared. All data in this dataset have no PHI, or other identifiers associated with patient. In order to access Cleveland Clinic data, please forward all requests to Amber Sours, SOURSA@ccf.org:
Amber Sours, MPH Research Supervisor | Epilepsy Center Cleveland Clinic | 9500 Euclid Ave. S3-399 | Cleveland, OH 44195 (216) 444-8638
You will need to sign a data use agreement (DUA).
For each subject, there was a raw EDF file, which was converted into the BrainVision format with mne_bids
.
Each subject with SEEG implantation, also has an Excel table, called electrode_layout.xlsx
, which outlines where the clinicians marked each electrode anatomically. Note that there is no rigorous atlas applied, so the main points of interest are: WM
, GM
, VENTRICLE
, CSF
, and OUT
, which represent white-matter, gray-matter, ventricle, cerebrospinal fluid and outside the brain. WM, Ventricle, CSF and OUT were removed channels from further analysis. These were labeled in the corresponding BIDS channels.tsv
sidecar file as status=bad
.
The dataset uploaded to openneuro.org
does not contain the sourcedata
since there was an extra
anonymization step that occurred when fully converting to BIDS.
Derivatives include: * fragility analysis * frequency analysis * graph metrics analysis * figures
These can be computed by following the following paper: Neural Fragility as an EEG Marker for the Seizure Onset Zone
Within each EDF file, there contain event markers that are annotated by clinicians, which may inform you of specific clinical events that are occuring in time, or of when they saw seizures onset and offset (clinical and electrographic).
During a seizure event, specifically event markers may follow this time course:
* eeg onset, or clinical onset - the onset of a seizure that is either marked electrographically, or by clinical behavior. Note that the clinical onset may not always be present, since some seizures manifest without clinical behavioral changes.
* Marker/Mark On - these are usually annotations within some cases, where a health practitioner injects a chemical marker for use in ICTAL SPECT imaging after a seizure occurs. This is commonly done to see which portions of the brain are active metabolically.
* Marker/Mark Off - This is when the ICTAL SPECT stops imaging.
* eeg offset, or clinical offset - this is the offset of the seizure, as determined either electrographically, or by clinical symptoms.
Other events included may be beneficial for you to understand the time-course of each seizure. Note that ICTAL SPECT occurs in all Cleveland Clinic data. Note that seizure markers are not consistent in their description naming, so one might encode some specific regular-expression rules to consistently capture seizure onset/offset markers across all dataset. In the case of UMMC data, all onset and offset markers were provided by the clinicians on an Excel sheet instead of via the EDF file. So we went in and added the annotations manually to each EDF file.
For various datasets, there are seizures present within the dataset. Generally there is only one seizure per EDF file. When seizures are present, they are marked electrographically (and clinically if present) via standard approaches in the epilepsy clinical workflow.
Clinical onset are just manifestation of the seizures with clinical syndromes. Sometimes the maker may not be present.
What is actually important in the evaluation of datasets is the clinical annotations of their localization hypotheses of the seizure onset zone.
These generally include:
* early onset: the earliest onset electrodes participating in the seizure that clinicians saw
* early/late spread (optional): the electrodes that showed epileptic spread activity after seizure onset. Not all seizures has spread contacts annotated.
For patients with the post-surgical MRI available, then the segmentation process outlined above tells us which electrodes were within the surgical removed brain region.
Otherwise, clinicians give us their best estimate, of which electrodes were resected/ablated based on their surgical notes.
For surgical patients whose postoperative medical records did not explicitly indicate specific resected or ablated contacts, manual visual inspection was performed to determine the approximate contacts that were located in later resected/ablated tissue. Postoperative T1 MRI scans were compared against post-SEEG implantation CT scans or CURRY coregistrations of preoperative MRI/post SEEG CT scans. Contacts of interest in and around the area of the reported resection were selected individually and the corresponding slice was navigated to on the CT scan or CURRY coregistration. After identifying landmarks of that slice (e.g. skull shape, skull features, shape of prominent brain structures like the ventricles, central sulcus, superior temporal gyrus, etc.), the location of a given contact in relation to these landmarks, and the location of the slice along the axial plane, the corresponding slice in the postoperative MRI scan was navigated to. The resected tissue within the slice was then visually inspected and compared against the distinct landmarks identified in the CT scans, if brain tissue was not present in the corresponding location of the contact, then the contact was marked as resected/ablated. This process was repeated for each contact of interest.
Adam Li, Chester Huynh, Zachary Fitzgerald, Iahn Cajigas, Damian Brusko, Jonathan Jagid, Angel Claudio, Andres Kanner, Jennifer Hopp, Stephanie Chen, Jennifer Haagensen, Emily Johnson, William Anderson, Nathan Crone, Sara Inati, Kareem Zaghloul, Juan Bulacio, Jorge Gonzalez-Martinez, Sridevi V. Sarma. Neural Fragility as an EEG Marker of the Seizure Onset Zone. bioRxiv 862797; doi: https://doi.org/10.1101/862797
Appelhoff, S., Sanderson, M., Brooks, T., Vliet, M., Quentin, R., Holdgraf, C., Chaumon, M., Mikulan, E., Tavabi, K., Höchenberger, R., Welke, D., Brunner, C., Rockhill, A., Larson, E., Gramfort, A. and Jas, M. (2019). MNE-BIDS: Organizing electrophysiological data into the BIDS format and facilitating their analysis. Journal of Open Source Software 4: (1896). https://doi.org/10.21105/joss.01896
Holdgraf, C., Appelhoff, S., Bickel, S., Bouchard, K., D'Ambrosio, S., David, O., … Hermes, D. (2019). iEEG-BIDS, extending the Brain Imaging Data Structure specification to human intracranial electrophysiology. Scientific Data, 6, 102. https://doi.org/10.1038/s41597-019-0105-7
Pernet, C. R., Appelhoff, S., Gorgolewski, K. J., Flandin, G., Phillips, C., Delorme, A., Oostenveld, R. (2019). EEG-BIDS, an extension to the brain imaging data structure for electroencephalography. Scientific Data, 6, 103. https://doi.org/10.1038/s41597-019-0104-8
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
This dataset is from the SEC's Financial Statements and Notes Data Set.
It was a personal project to see if I could make the queries efficient.
It's just been collecting dust ever since, maybe someone will make good use of it.
Data is up to about early-2024.
It doesn't differ from the source, other than it's compiled - so maybe you can try it out, then compile your own (with the link below).
Dataset was created using SEC Files and SQL Server on Docker.
For details on the SQL Server database this came from, see: "dataset-previous-life-info" folder, which will contain:
- Row Counts
- Primary/Foreign Keys
- SQL Statements to recreate database tables
- Example queries on how to join the data tables.
- A pretty picture of the table associations.
Source: https://www.sec.gov/data-research/financial-statement-notes-data-sets
Happy coding!
https://fred.stlouisfed.org/legal/#copyright-citation-requiredhttps://fred.stlouisfed.org/legal/#copyright-citation-required
Graph and download economic data for Liabilities: Notes in Circulation: Federal Reserve Notes in Actual Circulation (LNCFRNC) from 1914-11-20 to 2018-04-11 about actual, notes, liabilities, and USA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MeSH-CZ-2025-notes - training dataset
Czech translation of Medical Subject Headings version 2025 Download more MeSH-CZ data here @ nlk.cz
License
MeSH-CZ-2025 - training dataset © 2025 by National Medical Library is licensed under Creative Commons Attribution 4.0 International
Structure
"text1","text2","value","category" "term1","definition/note","0.5","cat1|cat2"
category - multiple values-codes separated by a pipe… See the full description on the dataset page: https://huggingface.co/datasets/NLK-NML/MeSH-CZ-2025-notes.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Current Effective Date: 0901Z 17 Apr 2025 to 0901Z 12 Jun 2025Note data provides additional information for Enroute chart production. It is provided as a geospatial vector file formats and depicted on Enroute charts. Note data information is published every eight weeks by the U.S. Department of Transportation, Federal Aviation Administration-Aeronautical Information Services.
https://pacific-data.sprep.org/resource/private-data-license-agreement-0https://pacific-data.sprep.org/resource/private-data-license-agreement-0
Project Idea Notes based on the developed SoE and NEMS
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classifying free-text from historical databases into research-compatible formats is a barrier for clinicians undertaking audit and research projects. The aim of this study was to (a) develop interactive active machine-learning model training methodology using readily available software that was (b) easily adaptable to a wide range of natural language databases and allowed customised researcher-defined categories, and then (c) evaluate the accuracy and speed of this model for classifying free text from two unique and unrelated clinical notes into coded data. A user interface for medical experts to train and evaluate the algorithm was created. Data requiring coding in the form of two independent databases of free-text clinical notes, each of unique natural language structure. Medical experts defined categories relevant to research projects and performed ‘label-train-evaluate’ loops on the training data set. A separate dataset was used for validation, with the medical experts blinded to the label given by the algorithm. The first dataset was 32,034 death certificate records from Northern Territory Births Deaths and Marriages, which were coded into 3 categories: haemorrhagic stroke, ischaemic stroke or no stroke. The second dataset was 12,039 recorded episodes of aeromedical retrieval from two prehospital and retrieval services in Northern Territory, Australia, which were coded into 5 categories: medical, surgical, trauma, obstetric or psychiatric. For the first dataset, macro-accuracy of the algorithm was 94.7%. For the second dataset, macro-accuracy was 92.4%. The time taken to develop and train the algorithm was 124 minutes for the death certificate coding, and 144 minutes for the aeromedical retrieval coding. This machine-learning training method was able to classify free-text clinical notes quickly and accurately from two different health datasets into categories of relevance to clinicians undertaking health service research.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the Thai Sticky Notes Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Thai language.
Dataset Contain & Diversity:Containing more than 2000 images, this Thai OCR dataset offers a wide distribution of different types of sticky note images. Within this dataset, you'll discover a variety of handwritten text, including quotes, sentences, and individual words on sticky notes. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.
To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Thai text.
The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.
All these sticky notes were written and images were captured by native Thai people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.
Metadata:In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.
This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Thai text recognition models.
Update & Custom Collection:We are committed to continually expanding this dataset by adding more images with the help of our native Thai crowd community.
If you require a customized OCR dataset containing sticky note images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.
Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.
License:This image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:Leverage this sticky notes image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Thai language. Your journey to improved language understanding and processing begins here.
Meeting notes from the Interagency Data Team meeting. The Interagency Data Team is a community of data analysts, or agency liaisons, who convene regularly with representation from DC agencies of all persuasions. Participants engage in discussions regarding the team’s core mission and priorities for a better kind of data culture – collection, application, sharing, classification and governance to name a few. The team is coordinated by the Office of the Chief Technology Officer (OCTO), lead by the Chief Data Officer (CDO), and directly supports the District of Columbia's Data Policy.Office of the Chief Technology Officer Data Team FY 2023 Projects2023 Enterprise Dataset Inventory is now OpenData Report
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DSH COVID-19 Patient Data reports on patient positives and testing counts at the facility level for DSH. The table reports on the following data fields:
Total patients that tested positive for COVID-19 since 5/16/2020
Patients newly positive for COVID-19 in the last 14 days
Patient deaths while patient was positive for COVID-19 since 5/30/2020
Total number of tests administered since 3/23/2020
COVID-19 test results for patients include DSH patients who are tested while receiving treatment at an outside medical facility. Data has been de-identified in accordance with CalHHS Data De-identification Guidelines. Counts between 1-10 are masked with "<11". Includes Patients Under Investigation (PUIs) testing and proactive testing of asymptomatic patients for surveillance of geriatric, medically fragile, and skilled nursing facility units and for patients upon admission, re-admission, or discharge. Includes all individuals who were positive for COVID-19 at time of death, regardless of underlying health conditions or whether the cause of death has been confirmed to be COVID-19 related illness. Metro-Norwalk is additional COVID-19 surge space and technically a branch location that is part of DSH Metropolitan Hospital.
https://digital.nhs.uk/about-nhs-digital/terms-and-conditionshttps://digital.nhs.uk/about-nhs-digital/terms-and-conditions
The outbreak of Coronavirus (COVID-19) in the last quarter of 2019-20 has led to unprecedented changes in the work and behaviour of GP practices and consequently the data in this publication may have been impacted. As such caution should be taken in drawing any conclusions from this data without due consideration of the circumstances both locally and nationally and would recommend that any use of this data is accompanied by an appropriate caveat. The Statement of Fitness for Work (the Med3 form or 'fit note') was introduced in April 2010 across England, Wales and Scotland. It enables healthcare professionals to give advice to their patients about the impact of their health condition on their fitness for work and is used to provide medical evidence for employers or to support a claim to health-related benefits through the Department for Work and Pensions (DWP). A fit note is issued after the first seven days of sickness absence (when patients can self-certify) if the healthcare professional assesses that the patient’s health affects their fitness for work. The healthcare professional can decide the patient is 'unfit for work' or 'may be fit for work subject to the following advice...' with accompanying notes on suggested adjustments or adaptations to the job role or workplace. In 2012, DWP funded a project to provide general practice's with the ability to produce computer-generated fit notes (eMed3) and this included the capability to collect the aggregated data generated. Fit notes are issued to patients by doctors, nurses, physiotherapists, occupational therapists and pharmacists following an assessment of their fitness for work. While they can be written by hand, most fit notes provided by general practice are now computer-generated. This quarterly statistical publication is produced by NHS England in collaboration with The Work and Health Unit, jointly sponsored by the Department for Work and Pensions and the Department of Health. It presents data on electronic fit notes issued in general practices in England for a given period. This is a ‘cumulative’ data collection. Weekly data collected will continue to be added to existing data. All data for all reporting periods is updated in each quarterly publication. From April 2019 all publications will contain data from practices who have TPP as their system supplier (which was not previously available), and accounts for one third of practices in England, consequently publications from this date may not be comparable to previous publications. All GP practices are mapped using current NHS geographies and recent changes may have resulted in a small number of practices not being mapped historically. These are shown as 'Unallocated' but are included in the England total. NHS England will publish data on a quarterly basis in October, January, April and July. 11/07/2024: the summary excel, table 13 was updated with missing text, no data was changed or impacted.
The data sets provide the text and detailed numeric information in all financial statements and their notes extracted from exhibits to corporate financial reports filed with the Commission using eXtensible Business Reporting Language (XBRL).