73 datasets found
  1. d

    330K+ Interior Design Images | AI Training Data | Annotated imagery data for...

    • datarade.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Seeds, 330K+ Interior Design Images | AI Training Data | Annotated imagery data for AI | Object & Scene Detection | Global Coverage [Dataset]. https://datarade.ai/data-products/200k-interior-design-images-ai-training-data-annotated-i-data-seeds
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset authored and provided by
    Data Seeds
    Area covered
    Indonesia, Curaçao, Ethiopia, Jamaica, Egypt, Congo, Tajikistan, Turks and Caicos Islands, Kuwait, Nicaragua
    Description

    This dataset features over 330,000 high-quality interior design images sourced from photographers worldwide. Designed to support AI and machine learning applications, it provides a richly varied and extensively annotated collection of indoor environment visuals.

    Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Each image is pre-annotated with object and scene detection metadata, making it ideal for tasks such as room classification, furniture detection, and spatial layout analysis. Popularity metrics, derived from engagement on our proprietary platform, are also included.

    1. Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions centered on interior design themes ensure a steady stream of fresh, high-quality submissions. Custom datasets can be sourced on-demand within 72 hours to fulfill specific requests, such as particular room types, design styles, or furnishings.

    2. Global Diversity: photographs have been sourced from contributors in over 100 countries, covering a wide spectrum of architectural styles, cultural aesthetics, and functional spaces. The images include homes, offices, restaurants, studios, and public interiors—ranging from minimalist and modern to classic and eclectic designs.

    3. High-Quality Imagery: the dataset includes standard to ultra-high-definition images that capture fine interior details. Both professionally staged and candid real-life spaces are included, offering versatility for training AI across design evaluation, object detection, and environmental understanding.

    4. Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. This provides valuable insights into global aesthetic trends, helping AI models learn user preferences, design appeal, and stylistic relevance.

    5. AI-Ready Design: the dataset is optimized for machine learning tasks such as interior scene recognition, style transfer, virtual staging, and layout generation. It integrates smoothly with popular AI development environments and tools.

    6. Licensing & Compliance: the dataset fully complies with data privacy regulations and includes transparent licensing suitable for commercial and academic use.

    Use Cases: 1. Training AI for interior design recommendation engines and virtual staging tools. 2. Enhancing smart home applications and spatial recognition systems. 3. Powering AR/VR platforms for virtual tours, furniture placement, and room redesign. 4. Supporting architectural visualization, decor style transfer, and real estate marketing.

    This dataset offers a comprehensive, high-quality resource tailored for AI-driven innovation in design, real estate, and spatial computing. Customizations are available upon request. Contact us to learn more!

  2. Annotated GMB Corpus

    • kaggle.com
    zip
    Updated Oct 7, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shoumik (2018). Annotated GMB Corpus [Dataset]. https://www.kaggle.com/shoumikgoswami/annotated-gmb-corpus
    Explore at:
    zip(473318 bytes)Available download formats
    Dataset updated
    Oct 7, 2018
    Authors
    Shoumik
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    Named Entity Recognition for annotated corpus using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

    Content

    The dataset an extract from GMB corpus which is tagged, annotated and built specifically to train the classifier to predict named entities such as name, location, etc. GMB is a fairly large corpus with a lot of annotations. Unfortunately, GMB is not perfect. It is not a gold standard corpus, meaning that it’s not completely human annotated and it’s not considered 100% correct. The corpus is created by using already existed annotators and then corrected by humans where needed. The attached dataset is in tab separated format, the goal is to create a good model to classify the Tag column. The data is labelled using the IOB tagging system. Here are the following classes in the dataset - geo = Geographical Entity org = Organization per = Person gpe = Geopolitical Entity tim = Time indicator art = Artifact eve = Event nat = Natural Phenomenon

    Acknowledgements

    The dataset is a subset of the original dataset shared here - https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/kernels

    Inspiration

    The data can be used by anyone who is starting off with NER in NLP.

  3. u

    ERA-40 Monthly Means of Isentropic Level Analysis Data

    • data.ucar.edu
    • rda-web-prod.ucar.edu
    • +2more
    grib
    Updated Oct 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Centre for Medium-Range Weather Forecasts (2025). ERA-40 Monthly Means of Isentropic Level Analysis Data [Dataset]. http://doi.org/10.5065/84RB-5G30
    Explore at:
    gribAvailable download formats
    Dataset updated
    Oct 9, 2025
    Dataset provided by
    NSF National Center for Atmospheric Research
    Authors
    European Centre for Medium-Range Weather Forecasts
    Description

    The monthly means of ECMWF ERA-40 reanalysis isentropic level analysis data are in this dataset.

  4. E

    Data from: Dataset of annotated headword-synonym-distractor triplets SYNDIST...

    • live.european-language-grid.eu
    binary format
    Updated Nov 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Dataset of annotated headword-synonym-distractor triplets SYNDIST [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/23910
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Nov 9, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains 51,023 headword-synonym-distractor triplets for 5,000 headwords. Distractor is defined as an incorrect answer/alternative to synonym, which can be similar to synonym in meaning and/or form. Headwords and their synonyms were obtained from the Thesaurus of Modern Slovene (http://hdl.handle.net/11356/1916), which is part of the Dictionary Database of Slovene (the database is available via API: https://wiki.cjvt.si/books/digital-dictionary-database-of-slovene). The criteria for selecting the headwords (nouns, adjectives, verbs, and adverbs) were that they had to be frequent and had to have several synonyms, preferably more than five.

    The distractors were obtained with the Gemini-2.0-flash (https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-0-flash) model, using the following prompt: "You are given headword and a synonym. Create a distractor — a word that looks similar to the synonym but has a different meaning. The distractor must be the same part of speech as the synonym (e.g., if the synonyms are verbs in their base form, the distractor must also be a verb in its base form). The distractor must not include sensitive vocabulary (e.g., words related to minorities, religion, sexual content, violence, etc.). The distractor must be a frequent word in the Slovene language. The distractor must look similar to the synonym but have a different meaning. Write the distractor in the same line as the headword and synonym, following this format: živahen - vesel - resen. These are the headword and synonym: {word} - {synonym} The distractor cannot be one of these words: {synonym_set}."

    The manual evaluation of all the distractors (with the exception of the distractors that were identified as existing synonyms in the Thesaurus) was conducted by two lexicographers. Each of them evaluted their own part, with the second one also subsequently inspecting the evaluations of the first one. The estimate is that around 30-35% of data was evaluated by both lexicographers. Five decisions were used: good distractor, bad distractor, problematic (i.e. difficult to decide due to certain characteristic such as being too similar to synonym, word being too archaic or informal etc.), same as synonym, and synonym candidate (likely being a legitimate (new) synonym of the headword).

    The dataset also includes the information on the frequency of synonyms and the distractors in the Gigafida 2.0 reference corpus of Slovene (http://hdl.handle.net/11356/1320). The frequency information is provided for single-word lemmas only (and not for multiword items, non-lemma single-word forms such as plural form of nouns or comparatives of adjectives). In addition, the information on similarity between the headwords and synonyms, and between the synonyms and distractors is provided. Similary is calculated using Gestalt pattern matching.

  5. 🌆 City Lifestyle Segmentation Dataset

    • kaggle.com
    zip
    Updated Nov 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UmutUygurr (2025). 🌆 City Lifestyle Segmentation Dataset [Dataset]. https://www.kaggle.com/datasets/umuttuygurr/city-lifestyle-segmentation-dataset
    Explore at:
    zip(11274 bytes)Available download formats
    Dataset updated
    Nov 15, 2025
    Authors
    UmutUygurr
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">

    🌆 About This Dataset

    This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.

    🎯 Perfect For:

    • 📊 K-Means, DBSCAN, Agglomerative Clustering
    • 🔬 PCA & t-SNE Dimensionality Reduction
    • 🗺️ Geospatial Visualization (Plotly, Folium)
    • 📈 Correlation Analysis & Feature Engineering
    • 🎓 Educational Projects (Beginner to Intermediate)

    📦 What's Inside?

    FeatureDescriptionRange
    10 FeaturesEconomic, environmental & social indicatorsRealistically scaled
    300 CitiesEurope, Asia, Americas, Africa, OceaniaDiverse distributions
    Strong CorrelationsIncome ↔ Rent (+0.8), Density ↔ Pollution (+0.6)ML-ready
    No Missing ValuesClean, preprocessed dataReady for analysis
    4-5 Natural ClustersMetropolitan hubs, eco-towns, developing centersPre-validated

    🔥 Key Features

    Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
    Regional Diversity: Each region has distinct economic and environmental characteristics
    Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
    Beginner-Friendly: No data cleaning required, includes example code
    Documented: Comprehensive README with methodology and use cases

    🚀 Quick Start Example

    import pandas as pd
    from sklearn.cluster import KMeans
    from sklearn.preprocessing import StandardScaler
    
    # Load and prepare
    df = pd.read_csv('city_lifestyle_dataset.csv')
    X = df.drop(['city_name', 'country'], axis=1)
    X_scaled = StandardScaler().fit_transform(X)
    
    # Cluster
    kmeans = KMeans(n_clusters=5, random_state=42)
    df['cluster'] = kmeans.fit_predict(X_scaled)
    
    # Analyze
    print(df.groupby('cluster').mean())
    

    🎓 Learning Outcomes

    After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics

    📚 Ideal For These Projects

    • 🏆 Kaggle Competitions: Practice clustering techniques
    • 📝 Academic Projects: Urban planning, sociology, environmental science
    • 💼 Portfolio Work: Showcase ML skills to employers
    • 🎓 Learning: Hands-on practice with unsupervised learning
    • 🔬 Research: Urban lifestyle segmentation studies

    🌍 Expected Clusters

    ClusterCharacteristicsExample Cities
    Metropolitan Tech HubsHigh income, density, rentSilicon Valley, Singapore
    Eco-Friendly TownsLow density, clean air, high happinessNordic cities
    Developing CentersMid income, high density, poor airEmerging markets
    Low-Income SuburbanLow infrastructure, incomeRural areas
    Industrial Mega-CitiesVery high density, pollutionManufacturing hubs

    🛠️ Technical Details

    • Format: CSV (UTF-8)
    • Size: ~300 rows × 10 columns
    • Missing Values: 0%
    • Data Types: 2 categorical, 8 numerical
    • Target Variable: None (unsupervised)
    • Correlation Strength: Pre-validated (r: 0.4 to 0.8)

    📖 What Makes This Dataset Special?

    Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code

    🏅 Use This Dataset If You Want To:

    ✓ Learn clustering without data cleaning hassles
    ✓ Practice PCA and dimensionality reduction
    ✓ Create beautiful geographic visualizations
    ✓ Understand feature correlation in real-world contexts
    ✓ Build a portfolio project with clear business insights

    📊 Acknowledgments

    This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.

    Happy Clustering! 🎉

  6. E

    Uniform Meaning Representation 2.1 (Czech and Latin)

    • live.european-language-grid.eu
    binary format
    Updated Jun 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Uniform Meaning Representation 2.1 (Czech and Latin) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23896
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jun 29, 2025
    License

    https://lindat.mff.cuni.cz/repository/xmlui/page/license-umr-2.0https://lindat.mff.cuni.cz/repository/xmlui/page/license-umr-2.0

    Area covered
    Czechia
    Description

    Czech and Latin UMR data, both manually annotated and programmatically converted from manually annotated tectogrammatical data.

  7. GMB Data

    • kaggle.com
    zip
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghassen Khaled (2025). GMB Data [Dataset]. https://www.kaggle.com/datasets/ghassenkhaled/gmb-data
    Explore at:
    zip(3265952 bytes)Available download formats
    Dataset updated
    Jul 31, 2025
    Authors
    Ghassen Khaled
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    For this notebook, we're going to use the GMB (Groningen Meaning Bank) corpus for named entity recognition. GMB is a fairly large corpus with a lot of annotations. The data is labeled using the IOB format (short for inside, outside, beginning), which means each annotation also needs a prefix of I, O, or B.

    The following classes appear in the dataset:

    LOC - Geographical Entity ORG - Organization PER - Person GPE - Geopolitical Entity TIME - Time indicator ART - Artifact EVE - Event NAT - Natural Phenomenon Note: GMB is not completely human annotated, and it’s not considered 100% correct. For this exercise, classes ART, EVE, and NAT were combined into a MISC class due to small number of examples for these classes.

  8. c

    Data from: Slovenian Word in Context dataset SloWiC 1.0

    • clarin.si
    • live.european-language-grid.eu
    Updated Mar 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timotej Knez; Slavko Žitnik (2023). Slovenian Word in Context dataset SloWiC 1.0 [Dataset]. https://clarin.si/repository/xmlui/handle/11356/1781?locale-attribute=en
    Explore at:
    Dataset updated
    Mar 23, 2023
    Authors
    Timotej Knez; Slavko Žitnik
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The SloWIC dataset is a Slovenian dataset for the Word in Context task. Each example in the dataset contains a target word with multiple meanings and two sentences that both contain the target word. Each example is also annotated with a label that shows if both sentences use the same meaning of the target word. The dataset contains 1808 manually annotated sentence pairs and additional 13150 automatically annotated pairs to help with training larger models. The dataset is stored in the JSON format following the format used in the SuperGLUE version of the Word in Context task (https://super.gluebenchmark.com/).

    Each example contains the following data fields: - word: The target word with multiple meanings - sentence1: The first sentence containing the target word - sentence2: The second sentence containing the target word - idx: The index of the example in the dataset - label: Label showing if the sentences contain the same meaning of the target word - start1: Start of the target word in the first sentence - start2: Start of the target word in the second sentence - end1: End of the target word in the first sentence - end2: End of the target word in the second sentence - version: The version of the annotation - manual_annotation: Boolean showing if the label was manually annotated - group: The group of annotators that labelled the example

  9. ECMWF ERA5: ensemble means of surface level analysis parameter data

    • catalogue.ceda.ac.uk
    • data-search.nerc.ac.uk
    Updated Jul 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Centre for Medium-Range Weather Forecasts (ECMWF) (2025). ECMWF ERA5: ensemble means of surface level analysis parameter data [Dataset]. https://catalogue.ceda.ac.uk/uuid/d8021685264e43c7a0868396a5f582d0
    Explore at:
    Dataset updated
    Jul 7, 2025
    Dataset provided by
    Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
    Authors
    European Centre for Medium-Range Weather Forecasts (ECMWF)
    License

    https://artefacts.ceda.ac.uk/licences/specific_licences/ecmwf-era-products.pdfhttps://artefacts.ceda.ac.uk/licences/specific_licences/ecmwf-era-products.pdf

    Area covered
    Earth
    Variables measured
    cloud_area_fraction, sea_ice_area_fraction, air_pressure_at_mean_sea_level, lwe_thickness_of_atmosphere_mass_content_of_water_vapor
    Description

    This dataset contains ERA5 surface level analysis parameter data ensemble means (see linked dataset for spreads). ERA5 is the 5th generation reanalysis project from the European Centre for Medium-Range Weather Forecasts (ECWMF) - see linked documentation for further details. The ensemble means and spreads are calculated from the ERA5 10 member ensemble, run at a reduced resolution compared with the single high resolution (hourly output at 31 km grid spacing) 'HRES' realisation, for which these data have been produced to provide an uncertainty estimate. This dataset contains a limited selection of all available variables and have been converted to netCDF from the original GRIB files held on the ECMWF system. They have also been translated onto a regular latitude-longitude grid during the extraction process from the ECMWF holdings. For a fuller set of variables please see the linked Copernicus Data Store (CDS) data tool, linked to from this record.

    Note, ensemble standard deviation is often referred to as ensemble spread and is calculated as the standard deviation of the 10-members in the ensemble (i.e., including the control). It is not the sample standard deviation, and thus were calculated by dividing by 10 rather than 9 (N-1). See linked datasets for ensemble member and ensemble mean data.

    The ERA5 global atmospheric reanalysis of the covers 1979 to 2 months behind the present month. This follows on from the ERA-15, ERA-40 rand ERA-interim re-analysis projects.

    An initial release of ERA5 data (ERA5t) is made roughly 5 days behind the present date. These will be subsequently reviewed ahead of being released by ECMWF as quality assured data within 3 months. CEDA holds a 6 month rolling copy of the latest ERA5t data. See related datasets linked to from this record. However, for the period 2000-2006 the initial ERA5 release was found to suffer from stratospheric temperature biases and so new runs to address this issue were performed resulting in the ERA5.1 release (see linked datasets). Note, though, that Simmons et al. 2020 (technical memo 859) report that "ERA5.1 is very close to ERA5 in the lower and middle troposphere." but users of data from this period should read the technical memo 859 for further details.

  10. Model output and data used for analysis

    • catalog.data.gov
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Model output and data used for analysis [Dataset]. https://catalog.data.gov/dataset/model-output-and-data-used-for-analysis
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The modeled data in these archives are in the NetCDF format (https://www.unidata.ucar.edu/software/netcdf/). NetCDF (Network Common Data Form) is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. It is also a community standard for sharing scientific data. The Unidata Program Center supports and maintains netCDF programming interfaces for C, C++, Java, and Fortran. Programming interfaces are also available for Python, IDL, MATLAB, R, Ruby, and Perl. Data in netCDF format is: • Self-Describing. A netCDF file includes information about the data it contains. • Portable. A netCDF file can be accessed by computers with different ways of storing integers, characters, and floating-point numbers. • Scalable. Small subsets of large datasets in various formats may be accessed efficiently through netCDF interfaces, even from remote servers. • Appendable. Data may be appended to a properly structured netCDF file without copying the dataset or redefining its structure. • Sharable. One writer and multiple readers may simultaneously access the same netCDF file. • Archivable. Access to all earlier forms of netCDF data will be supported by current and future versions of the software. Pub_figures.tar.zip Contains the NCL scripts for figures 1-5 and Chesapeake Bay Airshed shapefile. The directory structure of the archive is ./Pub_figures/Fig#_data. Where # is the figure number from 1-5. EMISS.data.tar.zip This archive contains two NetCDF files that contain the emission totals for 2011ec and 2040ei emission inventories. The name of the files contain the year of the inventory and the file header contains a description of each variable and the variable units. EPIC.data.tar.zip contains the monthly mean EPIC data in NetCDF format for ammonium fertilizer application (files with ANH3 in the name) and soil ammonium concentration (files with NH3 in the name) for historical (Hist directory) and future (RCP-4.5 directory) simulations. WRF.data.tar.zip contains mean monthly and seasonal data from the 36km downscaled WRF simulations in the NetCDF format for the historical (Hist directory) and future (RCP-4.5 directory) simulations. CMAQ.data.tar.zip contains the mean monthly and seasonal data in NetCDF format from the 36km CMAQ simulations for the historical (Hist directory), future (RCP-4.5 directory) and future with historical emissions (RCP-4.5-hist-emiss directory). This dataset is associated with the following publication: Campbell, P., J. Bash, C. Nolte, T. Spero, E. Cooter, K. Hinson, and L. Linker. Projections of Atmospheric Nitrogen Deposition to the Chesapeake Bay Watershed. Journal of Geophysical Research - Biogeosciences. American Geophysical Union, Washington, DC, USA, 12(11): 3307-3326, (2019).

  11. t

    Trusted Research Environments: Analysis of Characteristics and Data...

    • researchdata.tuwien.ac.at
    bin, csv
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber (2024). Trusted Research Environments: Analysis of Characteristics and Data Availability [Dataset]. http://doi.org/10.48436/cv20m-sg117
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    TU Wien
    Authors
    Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Trusted Research Environments (TREs) enable analysis of sensitive data under strict security assertions that protect the data with technical organizational and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks & their slight technical variations. To shine light on these problems, we give an overview of existing, publicly described TREs and a bibliography linking to the system description. We further analyze their technical characteristics, especially in their commonalities & variations and provide insight on their data type characteristics and availability. Our literature study shows that 47 TREs worldwide provide access to sensitive data of which two-thirds provide data themselves, predominantly via secure remote access. Statistical offices make available a majority of available sensitive data records included in this study.

    Methodology

    We performed a literature study covering 47 TREs worldwide using scholarly databases (Scopus, Web of Science, IEEE Xplore, Science Direct), a computer science library (dblp.org), Google and grey literature focusing on retrieving the following source material:

    • Peer-reviewed articles where available,
    • TRE websites,
    • TRE metadata catalogs.

    The goal for this literature study is to discover existing TREs, analyze their characteristics and data availability to give an overview on available infrastructure for sensitive data research as many European initiatives have been emerging in recent months.

    Technical details

    This dataset consists of five comma-separated values (.csv) files describing our inventory:

    • countries.csv: Table of countries with columns id (number), name (text) and code (text, in ISO 3166-A3 encoding, optional)
    • tres.csv: Table of TREs with columns id (number), name (text), countryid (number, refering to column id of table countries), structureddata (bool, optional), datalevel (one of [1=de-identified, 2=pseudonomized, 3=anonymized], optional), outputcontrol (bool, optional), inceptionyear (date, optional), records (number, optional), datatype (one of [1=claims, 2=linked records]), optional), statistics_office (bool), size (number, optional), source (text, optional), comment (text, optional)
    • access.csv: Table of access modes of TREs with columns id (number), suf (bool, optional), physical_visit (bool, optional), external_physical_visit (bool, optional), remote_visit (bool, optional)
    • inclusion.csv: Table of included TREs into the literature study with columns id (number), included (bool), exclusion reason (one of [peer review, environment, duplicate], optional), comment (text, optional)
    • major_fields.csv: Table of data categorization into the major research fields with columns id (number), life_sciences (bool, optional), physical_sciences (bool, optional), arts_and_humanities (bool, optional), social_sciences (bool, optional).

    Additionally, a MariaDB (10.5 or higher) schema definition .sql file is needed, properly modelling the schema for databases:

    • schema.sql: Schema definition file to create the tables and views used in the analysis.

    The analysis was done through Jupyter Notebook which can be found in our source code repository: https://gitlab.tuwien.ac.at/martin.weise/tres/-/blob/master/analysis.ipynb

  12. f

    Data Sheet 1_Exploring meaning in life from social network content in the...

    • figshare.com
    pdf
    Updated Nov 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qi Li; Mengyao Wang; Junjie Yan; Wu Jiake; Liang Zhao; Xin Wang; Bowen Yao; Lei Cao (2025). Data Sheet 1_Exploring meaning in life from social network content in the sleep scenario.pdf [Dataset]. http://doi.org/10.3389/fpubh.2025.1642085.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 11, 2025
    Dataset provided by
    Frontiers
    Authors
    Qi Li; Mengyao Wang; Junjie Yan; Wu Jiake; Liang Zhao; Xin Wang; Bowen Yao; Lei Cao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionThe exploration of life’s meaning has been a key topic across disciplines, and artificial intelligence is now beginning to investigate it.MethodsThis study leveraged social media to assess meaning in life (MIL) and its associated factors at individual and group levels. We compiled a diverse dataset consisting of microblog posts (N = 7,588,597) and responses from user surveys (N = 448), annotated using a combination of self-assessment, expert opinions, and ChatGPT-generated insights. Our methodology examined MIL in three ways: (1) developing deep learning models to assess MIL components, (2) applying semantic dependency graph algorithms to identify MIL associated factors, and (3) constructing eight subnetworks to analyze factors, their interrelations, and MIL differences.ResultsWe validated these methods and bridged two foundational MIL theories, highlighting their interconnections.DiscussionBy identifying psychological risk factors, our work may provide clues to mental health issues and inform possible intervention.

  13. E

    Data from: Metaphor annotations in Polish political debates from 2020 (TVP...

    • live.european-language-grid.eu
    binary format
    Updated Jun 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Metaphor annotations in Polish political debates from 2020 (TVP 2019-10-01 and TVN 2019-10-08) – presidential election [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8682
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jun 30, 2021
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    The data published here are a supplementary for a paper to be published in Metaphor and Social Words (under revision).

    Two debates organised and published by TVP and TVN were transcribed and annotated with Metaphor Identification Method. We have used eMargin software (a collaborative textual annotation tool, (Kehoe and Gee 2013) and a slightly modified version of MIP (Pragglejaz 2007). Each lexical unit in the transcript was labelled as a metaphor related word (MRW) if its “contextual meaning was related to the more basic meaning by some form of similarity” (Steen 2007). The meanings were established with the Wielki Słownik Języka Polskiego (Great Dictionary of Polish, ed. (Żmigrodzki 2019). In addition to MRW, lexemes which create a metaphorical expression together with MRW were tagged as metaphor expression word (MEW). At least two words are needed to identify the actual metaphorical expression, since MRW cannot appear without MEW. Grammatical construction of the metaphor (Sullivan 2009) is asymmetrical: one word is conceptually autonomous and the other is conceptually dependent on the first. Within construction grammar terms (Langacker 2008), metaphor related word is elaborated with/by metaphorical expression word, because the basic meaning of MRW is elaborated and extended to more figurative meaning only if it is used jointly with MEW. Moreover, the meaning of the MEW is rather basic, concrete, as it remains unchanged in connection with the MRW. This can be clearly seen in the expression often used in our data: “Służba zdrowia jest w zapaści” (“Health service suffers from a collapse.”) where the word “zapaść” (“collapse”) is an example of MRW and words “służba zdrowia” (“health service”) are labeled as MEW. The English translation of this expression needs a different verb, instead of “jest w zapaści” (“is in collapse”) the English unmarked collocation is “suffers from a collapse”, therefore words “suffers from a collapse” are labeled as MRW. The “collapse” could be caused by heart failure, such as cardiac arrest or any other life-threatening medical condition and “health service” is portrayed as if it could literally suffer from such a condition – a collapse.

    The data are in csv tables exported from xml files downloaded from eMargin site. Prior to annotation transcripts were divided to 40 parts, each for one annotator. MRW words are marked as MLN, MEW are marked as MLP and functional words within metaphorical expression are marked MLI, other words are marked just noana, which means no annotation needed.

  14. u

    NCEP Re-analysis Monthly Mean Data 2001-2004 for SBI Domain (Matlab) [NCEP]

    • data.ucar.edu
    • arcticdata.io
    • +1more
    matlab
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kent Moore (2025). NCEP Re-analysis Monthly Mean Data 2001-2004 for SBI Domain (Matlab) [NCEP] [Dataset]. http://doi.org/10.5065/D6ZK5DR6
    Explore at:
    matlabAvailable download formats
    Dataset updated
    Oct 7, 2025
    Authors
    Kent Moore
    Time period covered
    Jan 1, 2001 - Oct 31, 2004
    Area covered
    Description

    This data set contains National Centers for Environmental Prediction (NCEP) re-analysis monthly mean data 2001-2004 for the SBI domain in Matlab format.

  15. d

    Data from: Digital analysis of cDNA abundance; expression profiling by means...

    • catalog.data.gov
    • data.virginia.gov
    • +1more
    Updated Sep 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). Digital analysis of cDNA abundance; expression profiling by means of restriction fragment fingerprinting [Dataset]. https://catalog.data.gov/dataset/digital-analysis-of-cdna-abundance-expression-profiling-by-means-of-restriction-fragment-f
    Explore at:
    Dataset updated
    Sep 6, 2025
    Dataset provided by
    National Institutes of Health
    Description

    Background Gene expression profiling among different tissues is of paramount interest in various areas of biomedical research. We have developed a novel method (DADA, Digital Analysis of cDNA Abundance), that calculates the relative abundance of genes in cDNA libraries. Results DADA is based upon multiple restriction fragment length analysis of pools of clones from cDNA libraries and the identification of gene-specific restriction fingerprints in the resulting complex fragment mixtures. A specific cDNA cloning vector had to be constructed that governed missing or incomplete cDNA inserts which would generate misleading fingerprints in standard cloning vectors. Double stranded cDNA was synthesized using an anchored oligo dT primer, uni-directionally inserted into the DADA vector and cDNA libraries were constructed in E. coli. The cDNA fingerprints were generated in a PCR-free procedure that allows for parallel plasmid preparation, labeling, restriction digest and fragment separation of pools of 96 colonies each. This multiplexing significantly enhanced the throughput in comparison to sequence-based methods (e.g. EST approach). The data of the fragment mixtures were integrated into a relational database system and queried with fingerprints experimentally produced by analyzing single colonies. Due to limited predictability of the position of DNA fragments on the polyacrylamid gels of a given size, fingerprints derived solely from cDNA sequences were not accurate enough to be used for the analysis. We applied DADA to the analysis of gene expression profiles in a model for impaired wound healing (treatment of mice with dexamethasone). Conclusions The method proved to be capable of identifying pharmacologically relevant target genes that had not been identified by other standard methods routinely used to find differentially expressed genes. Due to the above mentioned limited predictability of the fingerprints, the method was yet tested only with a limited number of experimentally determined fingerprints and was able to detect differences in gene expression of transcripts representing 0.05% of the total mRNA population (e.g. medium abundant gene transcripts).

  16. Genius Song Lyrics

    • kaggle.com
    zip
    Updated Jan 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CarlosGDCJ (2023). Genius Song Lyrics [Dataset]. https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information/discussion
    Explore at:
    zip(3263274583 bytes)Available download formats
    Dataset updated
    Jan 11, 2023
    Authors
    CarlosGDCJ
    Description

    This dataset contain information as recent as 2022 scraped from Genius, a place where people can upload and annotate songs, poems and even books (but mostly songs). It builds upon the 5 Million Song Lyrics Dataset by using models to identify the native language of each entry.

    Genius songs are written in a format that needs some pre processing. Song metadata is often present between square brackets in the middle of the lyrics and the overall structure of the lyrics is preserved, meaning that each entry most likely contains a lot of new line characters that can cause some challenges when reading the data or passing it to a model. Other columns like the features also need similar care before being used. This dataset is an excellent choice if you wanna deal with NLP while also practicing your data cleaning skills.

    Feature overview

    ColumnMeaning
    titleTitle of the piece. Most entries are songs, but there are also some books, poems and even some other stuff
    tagGenre of the piece. Most non-music pieces are "misc", but not all. Some songs are also labeled as "misc"
    artistPerson or group the piece is attributed to
    yearRelease year
    viewsNumber of page views
    featuresOther artists that contributed
    lyricsLyrics
    idGenius identifier
    language_cld3Lyrics language according to CLD3. Not reliable results are NaN
    language_ftLyrics language according to FastText's langid. Values with low confidence (<0.5) are NaN
    languageCombines language_cld3 and language_ft. Only has a non NaN entry if they both "agree"

    Language index

    Below is a reference mapping language tags to their english names. I recommend using a library like langcodes to deal with this kind of stuff. | Tag | Language Name | |:------|:------------------| | af | Afrikaans | | als | Tosk Albanian | | am | Amharic | | ar | Arabic | | arz | Egyptian Arabic | | as | Assamese | | ast | Asturian | | az | Azerbaijani | | azb | South Azerbaijani | | ba | Bashkir | | be | Belarusian | | bg | Bulgarian | | bh | Bihari languages | | bn | Bangla | | bo | Tibetan | | br | Breton | | bs | Bosnian | | ca | Catalan | | ce | Chechen | | ceb | Cebuano | | ckb | Central Kurdish | | co | Corsican | | cs | Czech | | cv | Chuvash | | cy | Welsh | | da | Danish | | de | German | | dv | Divehi | | el | Greek | | en | English | | eo | Esperanto | | es | Spanish | | et | Estonian | | eu | Basque | | fa | Persian | | fi | Finnish | | fil | Filipino | | fr | French | | fy | Western Frisian | | ga | Irish | | gd | Scottish Gaelic | | gl | Galician | | gn | Guarani | | gu | Gujarati | | ha | Hausa | | haw | Hawaiian | | he | Hebrew | | hi | Hindi | | hmn | Hmong | | hr | Croatian | | hsb | Upper Sorbian | | ht | Haitian Creole | | hu | Hungarian | | hy | Armenian | | ia | Interlingua | | id | Indonesian | | ie | Interlingue | | ig | Igbo | | io | Ido | | is | Icelandic | | it | Italian | | ja | Japanese | | jbo | Lojban | | jv | Javanese | | ka | Georgian | | kk | Kazakh | | ...

  17. MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish

    • zenodo.org
    • data-staging.niaid.nih.gov
    • +1more
    bin, tsv, zip
    Updated Oct 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luis Gasco; Luis Gasco; Miranda Antonio; Miranda Antonio; Martin Krallinger; Martin Krallinger (2021). MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish [Dataset]. http://doi.org/10.5281/zenodo.4707104
    Explore at:
    zip, tsv, binAvailable download formats
    Dataset updated
    Oct 28, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Luis Gasco; Luis Gasco; Miranda Antonio; Miranda Antonio; Martin Krallinger; Martin Krallinger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotated corpora for MESINESP2 shared-task (Spanish BioASQ track, see https://temu.bsc.es/mesinesp2). BioASQ 2021 will be held at CLEF 2021 (scheduled in Bucharest, Romania in September) http://clef2021.clef-initiative.eu/

    Introduction:
    These corpora contain the data for each of the subtracks of MESINESP2 shared-task:

    • [Subtrack 1] MESINESP-L – Scientific Literature :
      • Training set: It contains all spanish records from LILACS and IBECS databases at the Virtual Health Library (VHL) with non-empty abstract written in Spanish. We have filtered out empty abstracts and non-Spanish abstracts. We have built the training dataset with the data crawled on 01/29/2021. This means that the data is a snapshot of that moment and that may change over time since LILACS and IBECS usually add or modify indexes after the first inclusion in the database. We distribute two different datasets:
        • Articles training set: This corpus contains the set of 237574 Spanish scientific papers in VHL that have at least one DeCS code assigned to them.
        • Full training set: This corpus contains the whole set of 249474 Spanish documents from VHL that have at leas one DeCS code assigned to them.
      • Development set: We provide a development set manually indexed by expert annotators. This dataset includes 1065 articles annotated with DeCS by three expert indexers in this controlled vocabulary. The articles were initially indexed by 7 annotators, after analyzing the Inter-Annotator Agreement among their annotations we decided to select the 3 best ones, considering their annotations the valid ones to build the test set. From those 1065 records:
        • 213 articles were annotated by more than one annotator. We have selected de union between annotations.
        • 852 articles were annotated by only one of the three selected annotators with better performance.
      • Test set: We provide a test set containing 10179 abstract without DeCS codes (not annotated) from LILACS and IBECS. Participants will have to predict the DecS codes for each of the abstracts in the entire dataset. However, the evaluation of the systems will only be made on the set of 500 expert-annotated abstracts that will be published as Gold Standard after finishing the evaluation period.
    • [Subtrack 2] MESINESP-T- Clinical Trials:
      • Training set: The training dataset contains records from Registro Español de Estudios Clínicos (REEC). REEC doesn't provide documents with the structure title/abstract needed in BioASQ, for that reason we have built artificial abstracts based on the content available in the data crawled using the REEC API. Clinical trials are not indexed with DeCS terminology, we have used as training data a set of 3560 clinical trials that were automatically annotated in the first edition of MESINESP and that were published as a Silver Standard outcome. Because the performance of the models used by the participants was variable, we have only selected predictions from runs with a MiF higher than 0.41, which corresponds with the submission of the best team.
      • Development set: We provide a development set manually indexed by expert annotators. This dataset includes 147 clinical trials annotated with DeCS by seven expert indexers in this controlled vocabulary.
      • Test set: The test dataset contains a collection of 8919 items. Out of this subset, there are 461 clinical trials coming from REEC and 8458 clinical trials artificially constructed from drug datasheets that have a similar structure to REEC documents. The evaluation of the systems will be performed on a set of 250 items annotated by DeCS experts following the same protocol as in subtrack 1. Similarly, these items will be published as Gold Standard after completion of the task.
    • [Subtrack 3] MESINESP-P – Patents:
      • Development set: We provide a Development set manually indexed by expert annotators. This dataset includes 115 patents in Spanish extracted from Google Patents which have the IPC code “A61P” and “A61K31”. We have selected these patents based on semantic similarity to the MESINESP-L training set to facilitate model generation and to try to improve model performance.
      • Test set: We provide a test set containing 68404 records that correspond to the total number of patents published in Spanish with the IPC codes “A61P” and “A61K31”. From this set, 150 will be selected and indexed by DeCS experts under the protocol defined in subtask 1, which will be used to evaluate the quality of the developed systems. Similarly to the development set, we selected these 150 records based on semantic similarity to the MESINESP-L training set.
    • Additional data:
      • We provide this information to the participants as additional data in the “Additional Data” folder. For each training, development, and test set there is an additional JSON file with the structure shown here. Each file contains entities related to medications, diseases, symptoms, and medical procedures extrated with the BSC NERs.

    Files structure:

    Subtrack1-Scientific_Literature.zip contains the corpora generated for subtrack 1. Content:

    • Subtrack1:
      • Train
        • training_set_track1_all.json: Full training set for subtrack 1.
        • training_set_track1_only_articles.json: Articles training set for subtrack 1.
      • Development
        • development_set_subtrack1.json: Manually annotated development set for subtrack 1.
      • Test
        • test_set_subtrack1.json: Test set for subtrack 1.

    Subtrack2-Clinical_Trials.zip contains the corpora generated for subtrack 2. Content:

    • Subtrack2:
      • Train
        • training_set_subtrack2.json: Training set for subtrack 2.
      • Development
        • development_set_subtrack2.json: Manually annotated development set for subtrack 2.
      • Test
        • test_set_subtrack2.json: Test set for subtrack 2.

    Subtrack3-Patents.zip contains the corpora generated for subtrack 3. Content:

    • Subtrack3:
      • Development
        • development_set_subtrack3.json: Manually annotated development set for subtrack 3.
      • Test
        • test_set_subtrack3.json: Test set for subtrack 3.

    Additional data.zip contains the corpora with additional data for each subtrack of MESINESP2.

    DeCS2020.tsv contains a DeCS table with the following structure:

    • DeCS code
    • Preferred descriptor (the preferred label in the Latin Spanish DeCS 2020 set)
    • List of synonyms (the descriptors and synonyms from Latin Spanish DeCS 2020 set, separated by pipes.

    DeCS2020.obo contains the *.obo file with the hierarchical relationships between DeCS descriptors.

    *Note: The obo and tsv files with DeCS2020 descriptors contain some additional COVID19 descriptors that will be included in future versions of DeCS. These items were provided by the Pan American Health Organization (PAHO), which has kindly shared this content to improve the results of the task by taking these descriptors into account.

    For further information, please visit https://temu.bsc.es/mesinesp2/ or email us at lgasco@bsc.es

  18. s

    10 Important Questions on Fundamental Analysis of Stocks – Meaning,...

    • smartinvestello.com
    html
    Updated Oct 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Smart Investello (2025). 10 Important Questions on Fundamental Analysis of Stocks – Meaning, Parameters, and Step-by-Step Guide - Data Table [Dataset]. https://smartinvestello.com/10-questions-on-fundamental-analysis/
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Oct 5, 2025
    Dataset authored and provided by
    Smart Investello
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset extracted from the post 10 Important Questions on Fundamental Analysis of Stocks – Meaning, Parameters, and Step-by-Step Guide on Smart Investello.

  19. Mall Customer Segmentation

    • kaggle.com
    zip
    Updated Jan 3, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nelakurthi Sudheer (2022). Mall Customer Segmentation [Dataset]. https://www.kaggle.com/datasets/nelakurthisudheer/mall-customer-segmentation/discussion
    Explore at:
    zip(1583 bytes)Available download formats
    Dataset updated
    Jan 3, 2022
    Authors
    Nelakurthi Sudheer
    Description

    Context

    This data set is created only for the learning purpose of the customer segmentation concepts , also known as market basket analysis . I will demonstrate this by using unsupervised ML technique (KMeans Clustering Algorithm) in the simplest form.

    Content

    You are owing a supermarket mall and through membership cards , you have some basic data about your customers like Customer ID, age, gender, annual income and spending score. Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data.

    Problem Statement You own the mall and want to understand the customers like who can be easily converge [Target Customers] so that the sense can be given to marketing team and plan the strategy accordingly.

    Acknowledgements

    Have a view on complete implementation of customer segmentation from the below provided link https://github.com/NelakurthiSudheer/Mall-Customers-Segmentation

    Inspiration

    By the end of this case study , you would be able to answer below questions. 1- How to achieve customer segmentation using machine learning algorithm (KMeans Clustering) in Python in simplest way. 2- Who are your target customers with whom you can start marketing strategy [easy to converse] 3- How the marketing strategy works in real world

  20. Swedish Speech Acts

    • kaggle.com
    zip
    Updated Jun 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Tufvesson (2024). Swedish Speech Acts [Dataset]. https://www.kaggle.com/datasets/danieltufvesson/swedics-speech-acts
    Explore at:
    zip(507722308 bytes)Available download formats
    Dataset updated
    Jun 3, 2024
    Authors
    Daniel Tufvesson
    Description

    What are Speech Acts?

    What is done through speaking? In a sense, a spoken utterance is just a string of vocal sounds. But in another sense, it is also a social action that has real effects on the world. For example, "Can you pass the salt?" is an act of requesting the salt, which can then result in obtaining the salt. These spoken actions are referred to as speech acts. We humans unconsciously understand and categorize speech acts all the time. The meaning of a speech act depends both on the syntax and semantics of the sentence and the conversational context in which it occurs.

    What are in these Data Sets?

    The data sets consist of isolated, Swedish sentences originating from online discussion forums (familjeliv.se and flashback.se). I have hand-labeled these with their respective speech acts.

    What Speech Acts are Annotated?

    The sentences are annotated with the following speech acts, which are taken from The Swedish Academy Grammar (Teleman et al., 1999):

    • Assertive: the speaker holds that the content of the sentence is true or at least true to a varying degree. For example: “They launched a car into space.”

    • Question: the speaker requests information regarding whether or not something is true, or under what conditions it is true. For example: “Are you busy?” or “How much does the car cost?”.

    • Directive: the speaker attempts to get the listener to carry out the action described by the sentence. For example: “Open the door!” or “Will you hold this for me?”

    • Expressive: the speaker expresses some feeling or emotional attitude about the content of the sentence. For example: “What an adorable dog!” or “The Avengers are awesome!”

    Why do these Exist?

    I created these data sets for training and evaluating two machine learning classifiers. These are available on GitHub.

    Data Files

    These are all CoNLL-U corpora. They all consist of sentences manually annotated with speech acts. The sentences were also automatically annotated with sentiment (positive, negative, neutral), and its probability score.

    • all-data.conllu.bz2 - All annotated sentences.

    • dev-set.conllu.bz2 - The dev (or validation) set. Split from all-data.conllu.bz2.

    • dev-test-set.conllu.bz2 - A test split of the dev set.

    • dev-test-set-upsampled.conllu.bz2 - An upsampled version of dev-test-set.

    • dev-train-set.conllu.bz2 - A train split of the dev set.

    • dev-train-set-upsampled.conllu.bz2 - An upsampled version of dev-train-set.

    • test-set.conllu.bz2 - The test set used for evaluation. Split from all-data.conllu.bz2.

    • test-set-upsampled.conllu.bz2 - An upsampled version of the test-set.

    • train-set.conllu.bz2 - A train set. This was automatically annotated by a rule-based classifier.

    CoNLL-U Format

    The corpora are formatted as CoNLL-U. In addition to the standard CoNLL-U annotations (Universal Dependencies, n.d.-a), I have added the following attributes as sentence comments to each sentence: - sent_id: a unique identifying integer. This is unique across all the data sets. - text: the full, unsegmented sentence. - date: the date and time on which the sentence was posted on the internet forum. - url: the URL of where the sentence was posted. - genre: the text genre of the sentence. This is technically superfluous since all the sentences are of the same genre, namely internet_forum. - x_sent_id: the ID of the sentence in the original corpus. - speech_act: the annotated speech act of the sentence, whether automatically or manually annotated. The possible values are assertion, question, directive, and expressive. - sentiment_label: the label denoting the sentiment of the sentence. This was automatically tagged by the sentiment tagger. The labels are either positive, neutral, or negative. - sentiment_score: the estimated probability of the sentiment label. As with the sentiment label, this was also done by the sentiment tagger.

    # sent_id = 2200888
    # text = Känns hoppfull med så många exempel.
    # date = 2009-10-26 16:19:10
    # url = http://www.familjeliv.se/forum/thread/48269320-bara-solsken-och-hopp/1#anchor-m3
    # genre = internet_forum
    # x_sent_id = 053044fa6
    # speech_act = expressive
    # sentiment_label = positive
    # sentiment_score = 0.9705862402915955
    1  Känns    känna|kännas  VERB  VB  Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass  0  root  _  _
    2  hoppfull  hoppfull    ADJ   JJ  Case=Nom|Definite=Ind|Degree=Pos|Gender=Com|Number=Sing  1  xcomp  _  _
    3  med     med       ADP   PP  _  6  case  _  _
    4  så     så       ADVERB AB  _  5  advmod  _  _
    5  många    _        ADJ   JJ  Case=Nom|Definite=Def,Ind|Degree=Pos|Gender=Com,Neut|Num...
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Data Seeds, 330K+ Interior Design Images | AI Training Data | Annotated imagery data for AI | Object & Scene Detection | Global Coverage [Dataset]. https://datarade.ai/data-products/200k-interior-design-images-ai-training-data-annotated-i-data-seeds

330K+ Interior Design Images | AI Training Data | Annotated imagery data for AI | Object & Scene Detection | Global Coverage

Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset authored and provided by
Data Seeds
Area covered
Indonesia, Curaçao, Ethiopia, Jamaica, Egypt, Congo, Tajikistan, Turks and Caicos Islands, Kuwait, Nicaragua
Description

This dataset features over 330,000 high-quality interior design images sourced from photographers worldwide. Designed to support AI and machine learning applications, it provides a richly varied and extensively annotated collection of indoor environment visuals.

Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Each image is pre-annotated with object and scene detection metadata, making it ideal for tasks such as room classification, furniture detection, and spatial layout analysis. Popularity metrics, derived from engagement on our proprietary platform, are also included.

  1. Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions centered on interior design themes ensure a steady stream of fresh, high-quality submissions. Custom datasets can be sourced on-demand within 72 hours to fulfill specific requests, such as particular room types, design styles, or furnishings.

  2. Global Diversity: photographs have been sourced from contributors in over 100 countries, covering a wide spectrum of architectural styles, cultural aesthetics, and functional spaces. The images include homes, offices, restaurants, studios, and public interiors—ranging from minimalist and modern to classic and eclectic designs.

  3. High-Quality Imagery: the dataset includes standard to ultra-high-definition images that capture fine interior details. Both professionally staged and candid real-life spaces are included, offering versatility for training AI across design evaluation, object detection, and environmental understanding.

  4. Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. This provides valuable insights into global aesthetic trends, helping AI models learn user preferences, design appeal, and stylistic relevance.

  5. AI-Ready Design: the dataset is optimized for machine learning tasks such as interior scene recognition, style transfer, virtual staging, and layout generation. It integrates smoothly with popular AI development environments and tools.

  6. Licensing & Compliance: the dataset fully complies with data privacy regulations and includes transparent licensing suitable for commercial and academic use.

Use Cases: 1. Training AI for interior design recommendation engines and virtual staging tools. 2. Enhancing smart home applications and spatial recognition systems. 3. Powering AR/VR platforms for virtual tours, furniture placement, and room redesign. 4. Supporting architectural visualization, decor style transfer, and real estate marketing.

This dataset offers a comprehensive, high-quality resource tailored for AI-driven innovation in design, real estate, and spatial computing. Customizations are available upon request. Contact us to learn more!

Search
Clear search
Close search
Google apps
Main menu