Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for HQ-EDIT
HQ-Edit, a high-quality instruction-based image editing dataset with total 197,350 edits. Unlike prior approaches relying on attribute guidance or human feedback on building datasets, we devise a scalable data collection pipeline leveraging advanced foundation models, namely GPT-4V and DALL-E 3. HQ-Edit’s high-resolution images, rich in detail and accompanied by comprehensive editing prompts, substantially enhance the capabilities of existing image editing… See the full description on the dataset page: https://huggingface.co/datasets/UCSC-VLAA/HQ-Edit-data-demo.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by AIDemos
Released under MIT
This is an auto-generated index table corresponding to a folder of files in this dataset with the same name. This table can be used to extract a subset of files based on their metadata, which can then be used for further analysis. You can view the contents of specific files by navigating to the "cells" tab and clicking on an individual file_kd.
This dataset provides a curated subset of the anonymized Google Analytics event data for three months of the Google Merchandise Store. The full dataset is available as a BigQuery Public Dataset.
The data includes information on items sold in the store and how much money was spent by users over time. It is both comprehensive enough to invite real analysis yet simple enough to facilitate teaching.
Foto von Arthur Osipyan auf Unsplash
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Abstract MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012 [1]. The MIMIC-III Clinical Database is available on PhysioNet (doi: 10.13026/C2XW26). Though deidentified, MIMIC-III contains detailed information regarding the care of real patients, and as such requires credentialing before access. To allow researchers to ascertain whether the database is suitable for their work, we have manually curated a demo subset, which contains information for 100 patients also present in the MIMIC-III Clinical Database. Notably, the demo dataset does not include free-text notes.
Background In recent years there has been a concerted move towards the adoption of digital health record systems in hospitals. Despite this advance, interoperability of digital systems remains an open issue, leading to challenges in data integration. As a result, the potential that hospital data offers in terms of understanding and improving care is yet to be fully realized.
MIMIC-III integrates deidentified, comprehensive clinical data of patients admitted to the Beth Israel Deaconess Medical Center in Boston, Massachusetts, and makes it widely accessible to researchers internationally under a data use agreement. The open nature of the data allows clinical studies to be reproduced and improved in ways that would not otherwise be possible.
The MIMIC-III database was populated with data that had been acquired during routine hospital care, so there was no associated burden on caregivers and no interference with their workflow. For more information on the collection of the data, see the MIMIC-III Clinical Database page.
Methods The demo dataset contains all intensive care unit (ICU) stays for 100 patients. These patients were selected randomly from the subset of patients in the dataset who eventually die. Consequently, all patients will have a date of death (DOD). However, patients do not necessarily die during an individual hospital admission or ICU stay.
This project was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA). Requirement for individual patient consent was waived because the project did not impact clinical care and all protected health information was deidentified.
Data Description MIMIC-III is a relational database consisting of 26 tables. For a detailed description of the database structure, see the MIMIC-III Clinical Database page. The demo shares an identical schema, except all rows in the NOTEEVENTS table have been removed.
The data files are distributed in comma separated value (CSV) format following the RFC 4180 standard. Notably, string fields which contain commas, newlines, and/or double quotes are encapsulated by double quotes ("). Actual double quotes in the data are escaped using an additional double quote. For example, the string she said "the patient was notified at 6pm"
would be stored in the CSV as "she said ""the patient was notified at 6pm"""
. More detail is provided on the RFC 4180 description page: https://tools.ietf.org/html/rfc4180
Usage Notes The MIMIC-III demo provides researchers with an opportunity to review the structure and content of MIMIC-III before deciding whether or not to carry out an analysis on the full dataset.
CSV files can be opened natively using any text editor or spreadsheet program. However, some tables are large, and it may be preferable to navigate the data stored in a relational database. One alternative is to create an SQLite database using the CSV files. SQLite is a lightweight database format which stores all constituent tables in a single file, and SQLite databases interoperate well with a number software tools.
DB Browser for SQLite is a high quality, visual, open source tool to create, design, and edit database files compatible with SQLite. We have found this tool to be useful for navigating SQLite files. Information regarding installation of the software and creation of the database can be found online: https://sqlitebrowser.org/
Release Notes Release notes for the demo follow the release notes for the MIMIC-III database.
Acknowledgements This research and development was supported by grants NIH-R01-EB017205, NIH-R01-EB001659, and NIH-R01-GM104987 from the National Institutes of Health. The authors would also like to thank Philips Healthcare and staff at the Beth Israel Deaconess Medical Center, Boston, for supporting database development, and Ken Pierce for providing ongoing support for the MIMIC research community.
Conflicts of Interest The authors declare no competing financial interests.
References Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Mo...
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
This data-set includes information about a sample of 8,887 of Open Educational Resources (OERs) from SkillsCommons website. It contains title, description, URL, type, availability date, issued date, subjects, and the availability of following metadata: level, time_required to finish, and accessibility.
This data-set has been used to build a metadata scoring and quality prediction model for OERs.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
kleberbaum/mesagona-demo-data dataset hosted on Hugging Face and contributed by the HF Datasets community
The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Deskripsi data sample
SC Census Tracts - Esri Demo Data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
_p_stRT/_I_STRT/_S_03_stCuSTr/_A_02.8_DIAMOND.....
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
_p_stRT/_I_STRT/_S_04_stPanTr/_A_04-MSA...........
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset was created by faizan dola
Released under CC BY-NC-SA 4.0
It contains the following files:
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Multiplexed imaging technologies provide insights into complex tissue architectures. However, challenges arise due to software fragmentation with cumbersome data handoffs, inefficiencies in processing large images (8 to 40 gigabytes per image), and limited spatial analysis capabilities. To efficiently analyze multiplexed imaging data, we developed SPACEc, a scalable end-to-end Python solution, that handles image extraction, cell segmentation, and data preprocessing and incorporates machine-learning-enabled, multi-scaled, spatial analysis, operated through a user-friendly and interactive interface. The demonstration dataset was derived from a previous analysis and contains TMA cores from a human tonsil and tonsillitis sample that were acquired with the Akoya PhenocyclerFusion platform. The dataset can be used to test the workflow and establish it on a user’s system or to familiarize oneself with the pipeline. Methods Tissue samples: Tonsil cores were extracted from a larger multi-tumor tissue microarray (TMA), which included a total of 66 unique tissues (51 malignant and semi-malignant tissues, as well as 15 non-malignant tissues). Representative tissue regions were annotated on corresponding hematoxylin and eosin (H&E)-stained sections by a board-certified surgical pathologist (S.Z.). Annotations were used to generate the 66 cores each with cores of 1mm diameter. FFPE tissue blocks were retrieved from the tissue archives of the Institute of Pathology, University Medical Center Mainz, Germany, and the Department of Dermatology, University Medical Center Mainz, Germany. The multi-tumor-TMA block was sectioned at 3µm thickness onto SuperFrost Plus microscopy slides before being processed for CODEX multiplex imaging as previously described. CODEX multiplexed imaging and processing To run the CODEX machine, the slide was taken from the storage buffer and placed in PBS for 10 minutes to equilibrate. After drying the PBS with a tissue, a flow cell was sealed onto the tissue slide. The assembled slide and flow cell were then placed in a PhenoCycler Buffer made from 10X PhenoCycler Buffer & Additive for at least 10 minutes before starting the experiment. A 96-well reporter plate was prepared with each reporter corresponding to the correct barcoded antibody for each cycle, with up to 3 reporters per cycle per well. The fluorescence reporters were mixed with 1X PhenoCycler Buffer, Additive, nuclear-staining reagent, and assay reagent according to the manufacturer's instructions. With the reporter plate and assembled slide and flow cell placed into the CODEX machine, the automated multiplexed imaging experiment was initiated. Each imaging cycle included steps for reporter binding, imaging of three fluorescent channels, and reporter stripping to prepare for the next cycle and set of markers. This was repeated until all markers were imaged. After the experiment, a .qptiff image file containing individual antibody channels and the DAPI channel was obtained. Image stitching, drift compensation, deconvolution, and cycle concatenation are performed within the Akoya PhenoCycler software. The raw imaging data output (tiff, 377.442nm per pixel for 20x CODEX) is first examined with QuPath software (https://qupath.github.io/) for inspection of staining quality. Any markers that produce unexpected patterns or low signal-to-noise ratios should be excluded from the ensuing analysis. The qptiff files must be converted into tiff files for input into SPACEc. Data preprocessing includes image stitching, drift compensation, deconvolution, and cycle concatenation performed using the Akoya Phenocycler software. The raw imaging data (qptiff, 377.442 nm/pixel for 20x CODEX) files from the Akoya PhenoCycler technology were first examined with QuPath software (https://qupath.github.io/) to inspect staining qualities. Markers with untenable patterns or low signal-to-noise ratios were excluded from further analysis. A custom CODEX analysis pipeline was used to process all acquired CODEX data (scripts available upon request). The qptiff files were converted into tiff files for tissue detection (watershed algorithm) and cell segmentation.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.
Johnyquest7/OctoTools-Gradio-Demo-User-Data dataset hosted on Hugging Face and contributed by the HF Datasets community
This data represents the age of clients served via digital literacy classes and workshops.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
_p_stRT/_I_STRT/_S_03_stCuSTr/_A_02.6_TransRate...
AutoTrain Dataset for project: demo-on-token-classification
Dataset Description
This dataset has been automatically processed by AutoTrain for project demo-on-token-classification.
Languages
The BCP-47 code for the dataset's language is unk.
Dataset Structure
Data Instances
A sample from this dataset looks as follows: [ { "tokens": [ "I", "will", "be", "traveling", "to", "Tokyo"… See the full description on the dataset page: https://huggingface.co/datasets/PhaniManda/autotrain-data-demo-on-token-classification.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
🇩🇪 독일
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for HQ-EDIT
HQ-Edit, a high-quality instruction-based image editing dataset with total 197,350 edits. Unlike prior approaches relying on attribute guidance or human feedback on building datasets, we devise a scalable data collection pipeline leveraging advanced foundation models, namely GPT-4V and DALL-E 3. HQ-Edit’s high-resolution images, rich in detail and accompanied by comprehensive editing prompts, substantially enhance the capabilities of existing image editing… See the full description on the dataset page: https://huggingface.co/datasets/UCSC-VLAA/HQ-Edit-data-demo.