24 datasets found
  1. Generated E-mail Spam by LLM

    • kaggle.com
    zip
    Updated Sep 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unique Data (2023). Generated E-mail Spam by LLM [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/generated-e-mail-spam
    Explore at:
    zip(78131 bytes)Available download formats
    Dataset updated
    Sep 20, 2023
    Authors
    Unique Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Generated E-mail Spam - text classification dataset

    The dataset consists of a CSV file containing of 300 generated email spam messages. Each row in the file represents a separate email message, its title and text. The dataset aims to facilitate the analysis and detection of spam emails.

    💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on our website to buy the dataset

    The dataset can be used for various purposes, such as training machine learning algorithms to classify and filter spam emails, studying spam email patterns, or analyzing text-based features of spam messages.

    Generated Data

    The data was generated using model text-davinci-003 Open AI API

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Fdefd7209a4510c98e556ca384c8ace68%2Finbox_618942_4d1fdedb2827152696dd0c0af05fd8da_f.png?generation=1695221394608089&alt=media" alt="">

    🧩 This is just an example of the data. Leave a request here to learn more

    Content

    File with the extension .csv (utf-8)

    includes the following information:

    • title: title of the email,
    • text: text of the email

    Email spam might be generated in accordance with your requirements.

    🚀 You can learn more about our high-quality unique datasets here

    keywords: spam mails dataset, email spam classification, spam or not-spam, spam e-mail database, spam detection system, email spamming data set, spam filtering system, spambase, feature extraction, spam ham email dataset, classifier, machine learning algorithms, automated, generated data, synthetic data, synthetic data generation, synthetic dataset , cybersecurity, text dataset, sentiment analysis, llm dataset, language modeling, large language models, text classification, text mining dataset, natural language texts, nlp, nlp open-source dataset, text data

  2. p

    Company Shareholder Email Data

    • listtodata.com
    • my.listtodata.com
    • +1more
    .csv, .xls, .txt
    Updated Jul 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    List to Data (2025). Company Shareholder Email Data [Dataset]. https://listtodata.com/company-shareholder-email-list
    Explore at:
    .csv, .xls, .txtAvailable download formats
    Dataset updated
    Jul 17, 2025
    Authors
    List to Data
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2025 - Dec 31, 2025
    Area covered
    Switzerland, New Zealand, Romania, Uganda, Myanmar, France, Belize, Slovakia, Nigeria, Saint Helena
    Variables measured
    phone numbers, Email Address, full name, Address, City, State, gender,age,income,ip address,
    Description

    Company shareholder email list is a database containing contact information for people who own shares in a company. This list is vital for business. It is a key tool for investor relations and often a legal must for public companies. It allows a company to send important updates directly to shareholders. This includes annual reports, news releases, and announcements about meetings or dividends. The list helps a company build trust and maintain a strong relationship with its investors. It is a key tool for keeping shareholders engaged and informed about the company’s performance and future. For example, if a company has news, like how much money it made, it can use the shareholder list. They send the message directly to shareholders.

    Company shareholder email list makes their messages more relevant and effective. A company can use this data to send personal messages. They can send a message based on how many shares a person owns. Public companies have legal requirements to communicate with their shareholders. This list helps them meet those obligations easily and efficiently. In short, this lead is vital. It helps you manage investor relationships. Also, it ensures compliance and keeps shareholders engaged. List to Data is a great tool for growing your business. Company shareholder email database is a special version of the shareholder dataset. It focuses on the most active and engaged shareholder. The company involves these people. They might go to meetings, vote on important decisions, or talk to the company’s leaders. This list also makes it easier for companies to send the right messages to the right people. For example, if a company knows where shareholders live, it can send them more relevant messages. This makes communication better for everyone.

    Company shareholder email database is helpful for companies because it helps them communicate better with their shareholder. For example, if a company can use this database to share news. They can tell shareholders about a new product. They can also ask them to vote on important things.

  3. Student Information Dataset

    • kaggle.com
    zip
    Updated Nov 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeeshan Ahmad (2024). Student Information Dataset [Dataset]. https://www.kaggle.com/datasets/zeeshier/student-information-dataset/code
    Explore at:
    zip(5606 bytes)Available download formats
    Dataset updated
    Nov 16, 2024
    Authors
    Zeeshan Ahmad
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset contains information about 200 students with 7 attributes. It is designed for use in exploratory data analysis, SQL practice, or machine learning experiments related to student performance or demographics.

    StudentID: Unique identifier for each student. Type: Integer (4-digit unique number). Example: 1023, 3045.

    Name: Full name of the student. Type: String (Text). Example: John Doe, Alice Smith.

    Age: Age of the student. Type: Integer (Range: 18–25). Example: 21, 19.

    Email: Email address of the student. Type: String (Unique). Example: john.doe@example.com, alice.smith@example.org.

    Department: Department or major of the student. Type: String (Categorical).

    Categories: Computer Science Biology Mathematics Physics Chemistry Example: Biology, Computer Science.

    GPA: Grade Point Average (GPA) of the student (on a 4.0 scale). Type: Float (2 decimal points). Range: 2.0–4.0. Example: 3.45, 3.89.

    GraduationYear: Expected graduation year of the student. Type: Integer. Range: 2024–2030. Example: 2025, 2028.

    Data Characteristics Rows: 200 (One for each student). Columns: 7 (Attributes described above).

    Unique Values: Each StudentID and Email is unique. The Department column contains 5 distinct values.

  4. Management, Organization and Innovation Survey 2009 - Serbia

    • microdata.worldbank.org
    • catalog.ihsn.org
    Updated Sep 26, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Bank for Reconstruction and Development (2013). Management, Organization and Innovation Survey 2009 - Serbia [Dataset]. https://microdata.worldbank.org/index.php/catalog/317
    Explore at:
    Dataset updated
    Sep 26, 2013
    Dataset provided by
    World Bank Grouphttp://www.worldbank.org/
    European Bank for Reconstruction and Development
    Time period covered
    2008 - 2009
    Area covered
    Serbia
    Description

    Abstract

    The study was conducted in Serbia between October 2008 and February 2009 as part of the first round of The Management, Organization and Innovation Survey. Data from 135 manufacturing companies with 50 to 5,000 full-time employees was analyzed.

    The survey topics include detailed information about a company and its management practices - production performance indicators, production target, ways employees are promoted/dealt with when underperforming. The study also focuses on organizational matters, innovation, spending on research and development, production outsourcing to other countries, competition, and workforce composition.

    Analysis unit

    The primary sampling unit of the study is the establishment. An establishment is a physical location where business is carried out and where industrial operations take place or services are provided. A firm may be composed of one or more establishments. For example, a brewery may have several bottling plants and several establishments for distribution. For the purposes of this survey an establishment is defined as a separate production unit, regardless of whether or not it has its own financial statements separate from those of the firm, and whether it has it own management and control over payroll. So the bottling plant of a brewery would be counted as an establishment.

    Universe

    The survey universe was defined as manufacturing establishments with at least fifty, but less than 5,000, full-time employees.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    Random sampling was used in the study. For all MOI countries, except Russia, there was a requirement that all regions must be covered and that the percentage of the sample in each region was required to be equal to at least one half of the percentage of the sample frame population in each region.

    In most countries the sample frame used was an extract from the Orbis database of Bureau van Dijk, which was provided to the Consultant by the EBRD. The sample frame contained details of company names, location, company size (number of employees), company performance measures and contact details. The sample frame downloaded from Orbis was cleaned by the EBRD through the addition of regional variables, updating addresses and phone numbers of companies.

    Examination of the Orbis sample frames showed their geographic distributions to be wide with many locations, a large number of which had only a small number of records. Each establishment was selected with two substitutes that can be used if it proves impossible to conduct an interview at the first establishment. In practice selection was confined to locations with the most records in the sample frame, so the sample frame was filtered to just the cities with the most establishments.

    The quality of the frame was assessed at the onset of the project. The frame proved to be useful though it showed positive rates of non-eligibility, repetition, non-existent units, etc. These problems are typical of establishment surveys. For Serbia, the percentage of confirmed non-eligible units as a proportion of the total number of contacts to complete the survey was 26.7% (82 out of 307 establishments).

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    Two different versions of the questionnaire were used. Questionnaire A was used when interviewing establishments that are part of multiestablishment firms, while Questionnaire B was used when interviewing single-establishment firms. Questionnaire A incorporates all questions from Questionnaire B, the only difference is in the reference point, which is the so-called national firm in the first part of Questionnaire A and firm in Questionnaire B. Second part of the questionnaire refers to the interviewed establishment only in both Questionnaire A and Questionnaire B. Each variation of the questionnaire is identified by the index variable, a0.

    Response rate

    Item non-response was addressed by two strategies: - For sensitive questions that may generate negative reactions from the respondent, such as ownership information, enumerators were instructed to collect the refusal to respond as (-8). - Establishments with incomplete information were re-contacted in order to complete this information, whenever necessary. However, there were clear cases of low response.

    Survey non-response was addressed by maximising efforts to contact establishments that were initially selected for interviews. Up to 15 attempts (but at least 4 attempts) were made to contact an establishment for interview at different times/days of the week before a replacement establishment (with similar characteristics) was suggested for interview. Survey non-response did occur, but substitutions were made in order to potentially achieve the goals.

    Additional information about sampling, response rates and survey implementation can be found in "MOI Survey Report on Methodology and Observations 2009" in "Technical Documents" folder.

  5. Chinese Language Datasets | 583KTranslations | 141K Words | NLP | Dictionary...

    • datarade.ai
    Updated Aug 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). Chinese Language Datasets | 583KTranslations | 141K Words | NLP | Dictionary Display | Translations Data | APAC coverage | Mandarin | Cantonese [Dataset]. https://datarade.ai/data-products/chinese-language-datasets-583ktranslations-178k-words-n-oxford-languages
    Explore at:
    .json, .xml, .csv, .txtAvailable download formats
    Dataset updated
    Aug 30, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    Macao, Malaysia, Indonesia, China, Taiwan, Hong Kong, Singapore
    Description

    Comprehensive Chinese language datasets with linguistic annotations, including headwords, definitions, word senses, usage examples, part-of-speech (POS) tags, semantic metadata, and contextual usage details. Covering Simplified and Traditional writing systems.

    Our Chinese language datasets are carefully compiled and annotated by language and linguistic experts. The below datasets are available for license:

    1. Mandarin Chinese (simplified) Monolingual Dictionary Data
    2. Mandarin Chinese (traditional) Monolingual Dictionary Data
    3. Mandarin Chinese (simplified) Bilingual Dictionary Data
    4. Mandarin Chinese (traditional) Bilingual Dictionary Data
    5. Mandarin Chinese (simplified) Synonyms and Antonyms Data

    Key Features (approximate numbers):

    1. Mandarin Chinese (simplified) Monolingual Dictionary Data

    Our Mandarin Chinese (simplified) monolingual features clear definitions, headwords, examples, and comprehensive coverage of the Mandarin Chinese language spoken today.

    • Words: 81,300
    • Senses: 62,400
    • Example sentences: 80,700
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    1. Mandarin Chinese (traditional) Monolingual Dictionary Data

    Our Mandarin Chinese (traditional) monolingual features clear definitions, headwords, examples, and comprehensive coverage of the Mandarin Chinese language spoken today.

    • Words: 60,100
    • Senses: 144,700
    • Example sentences: 29,900
    • Format: XML format
    • Delivery: Email (link-based file sharing)
    1. Mandarin Chinese (simplified) Bilingual Dictionary Data

    The bilingual data provides translations in both directions, from English to Mandarin Chinese (simplified) and from Mandarin Chinese (simplified) to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality.

    • Translations: 367,600
    • Senses: 204,500
    • Example translations: 150,900
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Mandarin Chinese (traditional) Bilingual Dictionary Data

    The bilingual data provides translations in both directions, from English to Mandarin Chinese (traditional) and from Mandarin Chinese (traditional) to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality.

    • Translations: 215,600
    • Senses: 202,800
    • Example sentences: 149,700
    • Format: XML format
    • Delivery: Email (link-based file sharing)
    1. Mandarin Chinese (simplified) Synonyms and Antonyms Data

    The Mandarin Chinese (simplified) Synonyms and Antonyms Dataset is a leading resource offering comprehensive, up-to-date coverage of word relationships in contemporary Mandarin Chinese. It includes rich linguistic detail such as precise definitions and part-of-speech (POS) tags, making it an essential asset for developing AI systems and language technologies that require deep semantic understanding.

    • Synonyms: 3,800
    • Antonyms: 3,180
    • Format: XML format
    • Delivery: Email (link-based file sharing)

    Use Cases:

    We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

    If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

    Pricing:

    Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

    Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.

    Please note that some datasets may have rights restrictions. Contact us for more information.

    About the sample:

    The samples offer a brief overview of one or two language datasets (monolingual or/and bilingual dictionary data). To help you explore the structure and features of our dataset, we provide a sample in CSV format for preview purposes only.

    If you need the complete original sample or more details about any dataset, please contact us (Growth.OL@oup.com) to request access or further information.

  6. d

    Data from: Experience collecting interim data on mortality: an example from...

    • catalog.data.gov
    • odgavaprod.ogopendata.com
    • +2more
    Updated Sep 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). Experience collecting interim data on mortality: an example from the RALES study [Dataset]. https://catalog.data.gov/dataset/experience-collecting-interim-data-on-mortality-an-example-from-the-rales-study
    Explore at:
    Dataset updated
    Sep 6, 2025
    Dataset provided by
    National Institutes of Health
    Description

    Introduction The Randomized Aldactone Evaluation Study (RALES) randomized 822 patients to receive 25 mg spironolactone daily and 841 to receive placebo. The primary endpoint was death from all causes. Randomization began on March 24, 1995; recruitment was completed on December 31, 1996; follow-up was scheduled to continue through December 31, 1999. Evidence of a sizeable benefit on mortality emerged early in the RALES. The RALES data safety monitoring board (DSMB), which met semiannually throughout the trial, used a prespecified statistical guideline to recommend stopping for efficacy. At the DSMB's request, its meetings were preceded by an 'endpoint sweep', that is, a census of all participants to confirm their vital status. Methods We used computer simulation to evaluate the effect of the sweeps. Results The sweeps led to an estimated 5 to 8% increase in the number of reported deaths at the fourth and fifth interim analyses. The data crossed the statistical boundary at the fifth interim analysis. If investigators had reported all deaths within the protocol-required 24-h window, the DSMB might have recommended stopping after the fourth interim analysis. Discussion Although endpoint sweeps can cause practical problems at the clinical centers, sweeps are very useful if the intervals between patient visits or contact are long or if endpoints require adjudication by committee, reading center, or central laboratory. Conclusion We recommend that trials with interim analyses institute active reporting of the primary endpoints and endpoint sweeps.

  7. Z

    ZeroCostDL4Mic - YoloV2 example training and test dataset

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Jul 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Jacquemet; Lucas von Chamier (2020). ZeroCostDL4Mic - YoloV2 example training and test dataset [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3941907
    Explore at:
    Dataset updated
    Jul 14, 2020
    Dataset provided by
    MRC-Laboratory for Molecular Cell Biology. University College London, London, UK
    Åbo Akademi University, Turku, Finland
    Authors
    Guillaume Jacquemet; Lucas von Chamier
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Name: ZeroCostDL4Mic - YoloV2 example training and test dataset

    (see our Wiki for details)

    Data type: 2D grayscale .png images with corresponding bounding box annotations in .xml PASCAL Voc format.

    Microscopy data type: Phase contrast microscopy data (brightfield)

    Microscope: Inverted Zeiss Axio zoom widefield microscope equipped with an AxioCam MRm camera, an EL Plan-Neofluar 20 × /0.5 NA objective (Carl Zeiss), with a heated chamber (37 °C) and a CO2 controller (5%).

    Cell type: MDA-MB-231 cells migrating on cell-derived matrices generated by fibroblasts.

    File format: .png (8-bit)

    Image size: 1388 x 1040 px (323 nm)

    Author(s): Guillaume Jacquemet1,2,3, Lucas von Chamier4,5

    Contact email: lucas.chamier.13@ucl.ac.uk and guillaume.jacquemet@abo.fi

    Affiliation(s):

    1) Faculty of Science and Engineering, Cell Biology, Åbo Akademi University, 20520 Turku, Finland

    2) Turku Bioscience Centre, University of Turku and Åbo Akademi University, FI-20520 Turku

    3) ORCID: 0000-0002-9286-920X

    4) MRC-Laboratory for Molecular Cell Biology. University College London, London, UK

    5) ORCID: 0000-0002-9243-912X

    Associated publications: Jacquemet et al 2016. DOI: 10.1038/ncomms13297

    Funding bodies: G.J. was supported by grants awarded by the Academy of Finland, the Sigrid Juselius Foundation and Åbo Akademi University Research Foundation (CoE CellMech) and by Drug Discovery and Diagnostics strategic funding to Åbo Akademi University.

  8. f

    Data from: Example interview questions.

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Oct 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Baker, Graham; Lloyd, Scott; Sivaramakrishnan, Divya; Jepson, Ruth; Manner, Jillian (2024). Example interview questions. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001354087
    Explore at:
    Dataset updated
    Oct 23, 2024
    Authors
    Baker, Graham; Lloyd, Scott; Sivaramakrishnan, Divya; Jepson, Ruth; Manner, Jillian
    Description

    BackgroundThe COVID-19 pandemic changed the way many industries work, including contact centres, with some employees working from home and new on-site restrictions/measures in place representing even greater challenges for employers around staff engagement and wellbeing. This study aimed to understand the interplay of individual, social, environmental and organisational factors influencing physical activity and sedentary behaviour in UK contact centre employees, how the pandemic impacted these factors, and their relevance for the future of hybrid working.MethodsIndividual interviews (n = 33) were conducted with participants (staff working full and part time, on site and from home) from four UK contact centres. A topic guide based on the ecological model was developed to understand current barriers and facilitators to physical activity and (reducing) sedentary behaviour during and outside of working hours. Thematic analysis was carried out using a codebook and a deductive coding approach to identify themes.ResultsThree key insights are provided. First, participants felt they were generally sitting more and moving less since the first UK-wide lockdown. Second, factors which negatively impacted on these behaviours were evident across all levels of the ecological model. These included individual and social barriers (e.g., lack of motivation and preferable physical activity options) as well as environmental and organisational barriers (e.g., poor home office setup, back-to-back virtual meetings). There were a mix of new and existing barriers (exacerbated by the pandemic) and several of these were linked to homeworking. Third, organisational support requirements (e.g., homeworking ergonomic support) and existing facilitators (such as the provision of informational support and flexible working arrangements) were identified.ConclusionSolutions to reduce sedentary behaviours and increase physical activity in contact centres need to address barriers from the individual to the organisational level. Whilst the study was undertaken in the UK, the results are like to be applicable globally.Trial registrationClinical trial registration: The trial for the wider project has been registered on the ISRCTN database: http://www.isrctn.com/ISRCTN11580369.

  9. Management, Organization and Innovation Survey 2009 - Belarus

    • microdata.worldbank.org
    • catalog.ihsn.org
    Updated Sep 26, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Bank for Reconstruction and Development (2013). Management, Organization and Innovation Survey 2009 - Belarus [Dataset]. https://microdata.worldbank.org/index.php/catalog/138
    Explore at:
    Dataset updated
    Sep 26, 2013
    Dataset provided by
    World Bank Grouphttp://www.worldbank.org/
    European Bank for Reconstruction and Development
    Time period covered
    2008 - 2009
    Area covered
    Belarus
    Description

    Abstract

    The study was conducted in Belarus between October 2008 and February 2009 as part of the first round of The Management, Organization and Innovation Survey. Data from 102 manufacturing companies with 50 to 5,000 full-time employees was analyzed.

    The survey topics include detailed information about a company and its management practices - production performance indicators, production target, ways employees are promoted/dealt with when underperforming. The study also focuses on organizational matters, innovation, spending on research and development, production outsourcing to other countries, competition, and workforce composition.

    Geographic coverage

    National

    Analysis unit

    The primary sampling unit of the study is the establishment. An establishment is a physical location where business is carried out and where industrial operations take place or services are provided. A firm may be composed of one or more establishments. For example, a brewery may have several bottling plants and several establishments for distribution. For the purposes of this survey an establishment is defined as a separate production unit, regardless of whether or not it has its own financial statements separate from those of the firm, and whether it has it own management and control over payroll. So the bottling plant of a brewery would be counted as an establishment.

    Universe

    The survey universe was defined as manufacturing establishments with at least fifty, but less than 5,000, full-time employees.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    Random sampling was used in the study. For all MOI countries, except Russia, there was a requirement that all regions must be covered and that the percentage of the sample in each region was required to be equal to at least one half of the percentage of the sample frame population in each region.

    In most countries the sample frame used was an extract from the Orbis database of Bureau van Dijk, which was provided to the Consultant by the EBRD. The sample frame contained details of company names, location, company size (number of employees), company performance measures and contact details. The sample frame downloaded from Orbis was cleaned by the EBRD through the addition of regional variables, updating addresses and phone numbers of companies.

    Examination of the Orbis sample frames showed their geographic distributions to be wide with many locations, a large number of which had only a small number of records. Each establishment was selected with two substitutes that can be used if it proves impossible to conduct an interview at the first establishment. In practice selection was confined to locations with the most records in the sample frame, so the sample frame was filtered to just the cities with the most establishments.

    The quality of the frame was assessed at the onset of the project. The frame proved to be useful though it showed positive rates of non-eligibility, repetition, non-existent units, etc. These problems are typical of establishment surveys. For Belarus, the percentage of confirmed non-eligible units as a proportion of the total number of contacts to complete the survey was 30.6% (83 out of 271 establishments).

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    Two different versions of the questionnaire were used. Questionnaire A was used when interviewing establishments that are part of multiestablishment firms, while Questionnaire B was used when interviewing single-establishment firms. Questionnaire A incorporates all questions from Questionnaire B, the only difference is in the reference point, which is the so-called national firm in the first part of Questionnaire A and firm in Questionnaire B. Second part of the questionnaire refers to the interviewed establishment only in both Questionnaire A and Questionnaire B. Each variation of the questionnaire is identified by the index variable, a0.

    Response rate

    Item non-response was addressed by two strategies: - For sensitive questions that may generate negative reactions from the respondent, such as ownership information, enumerators were instructed to collect the refusal to respond as (-8). - Establishments with incomplete information were re-contacted in order to complete this information, whenever necessary. However, there were clear cases of low response.

    Survey non-response was addressed by maximising efforts to contact establishments that were initially selected for interviews. Up to 15 attempts (but at least 4 attempts) were made to contact an establishment for interview at different times/days of the week before a replacement establishment (with similar characteristics) was suggested for interview. Survey non-response did occur, but substitutions were made in order to potentially achieve the goals.

    Additional information about sampling, response rates and survey implementation can be found in "MOI Survey Report on Methodology and Observations 2009" in "Technical Documents" folder.

  10. f

    The example data of Alpha-LFM

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhang, Meng; Zhu, Lanxin; Sun, Jiahao; Yi, Chengqiang (2025). The example data of Alpha-LFM [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002104115
    Explore at:
    Dataset updated
    Jul 8, 2025
    Authors
    Zhang, Meng; Zhu, Lanxin; Sun, Jiahao; Yi, Chengqiang
    Description

    Dataset Overview:1. First volume of Reconstructed Sequences (located at subfolder "Representative Figures Data" )Data type: 3D reconstruction (first volume) of mitochondria–lysosome interactionFile format: .tif (8-bit)Image size: 1890x1380x161 px3 for mitochondria and lysosome, 1500x1830x101 px3 for peroxisome(Properties: 0.054x0.054x0.054 um3)2. Example Training Data for Alpha-LFM Net (located at subfolder "Example_Training_Data" )Data type: Lysosome training images (~ 2000 pairs) used for Alpha-Net training in light-field microscopyFile format: .tif (8-bit)Image size: 480x480 px2 for LF projections (Properties: 0.108x0.108 um2) 480x480x161 px3 for 3D volumes (Properties: 0.054x0.054x0.054 um3).--------------------------------------------------------------------------------------------------------------------------------------------Authors: Lanxin Zhu & Jiahao Sun & Chengqiang Yi & Meng ZhangContact email: lanxinzhu@hust.edu.cn

  11. EMEA Data Suite | 3.3M Translations | 1.9M Words | 22 Languages | Natural...

    • datarade.ai
    Updated Aug 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). EMEA Data Suite | 3.3M Translations | 1.9M Words | 22 Languages | Natural Language Processing (NLP) Data | Translation Data | TTS | EMEA Coverage [Dataset]. https://datarade.ai/data-products/emea-data-suite-3-3m-translations-1-9m-words-23-languag-oxford-languages
    Explore at:
    .json, .xml, .csv, .xls, .txt, .mp3, .wavAvailable download formats
    Dataset updated
    Aug 8, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    Uganda, Seychelles, Central African Republic, Burundi, Israel, Morocco, Spain, Bosnia and Herzegovina, Syrian Arab Republic, Romania
    Description

    EMEA Data Suite offers 43 high-quality language datasets covering 23 languages spoken in the region. Ideal for NLP, AI, LLMs, translation, and education, it combines linguistic depth and regional authenticity to power scalable, multilingual language technologies.

    Discover our expertly curated language datasets in the EMEA Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:

    • Monolingual and Bilingual Dictionary Data Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.

    • Sentence Corpora Curated examples of real-world usage with contextual annotations for training and evaluation.

    • Synonyms & Antonyms Lexical relations to support semantic search, paraphrasing, and language understanding.

    • Audio Data Native speaker recordings for speech recognition, TTS, and pronunciation modeling.

    • Word Lists Frequency-ranked and thematically grouped lists for vocabulary training and NLP tasks.

    Each language may contain one or more types of language data. Depending on the dataset, we can provide these in formats such as XML, JSON, TXT, XLSX, CSV, WAV, MP3, and more. Delivery is currently available via email (link-based sharing) or REST API.

    If you require more information about a specific dataset, please contact us Growth.OL@oup.com.

    Below are the different types of datasets available for each language, along with their key features and approximate metrics. If you have any questions or require additional assistance, please don't hesitate to contact us.

    1. Arabic Monolingual Dictionary Data: 66,500 words | 98,700 senses | 70,000 example sentences.

    2. Arabic Bilingual Dictionary Data: 116,600 translations | 88,300 senses | 74,700 example translations.

    3. Arabic Synonyms and Antonyms Data: 55,100 synonyms.

    4. British English Monolingual Dictionary Data: 146,000 words | 230,000 senses | 149,000 example sentences.

    5. British English Synonyms and Antonyms Data: 600,000 synonyms | 22,000 antonyms

    6. British English Pronunciations with Audio: 250,000 transcriptions (IPA) |180,000 audio files.

    7. Catalan Monolingual Dictionary Data: 29,800 words | 47,400 senses | 25,600 example sentences.

    8. Catalan Bilingual Dictionary Data: 76,800 translations | 109,350 senses | 26,900 example translations.

    9. Croatian Monolingual Dictionary Data: 129,600 words | 164,760 senses | 34,630 example sentences.

    10. Croatian Bilingual Dictionary Data: 100,700 translations | 91,600 senses | 10,180 example translations.

    11. Czech Bilingual Dictionary Data: 426,473 translations | 199,800 senses | 95,000 example translations.

    12. Danish Bilingual Dictionary Data: 129,000 translations | 91,500 senses | 23,000 example translations.

    13. French Monolingual Dictionary Data: 42,000 words | 56,000 senses | 43,000 example sentences.

    14. French Bilingual Dictionary Data: 380,000 translations | 199,000 senses | 146,000 example translations.

    15. German Monolingual Dictionary Data: 85,500 words | 78,000 senses | 55,000 example sentences.

    16. German Bilingual Dictionary Data: 393,000 translations | 207,500 senses | 129,500 example translations.

    17. German Word List Data: 338,000 wordforms.

    18. Greek Monolingual Dictionary Data: 47,800 translations | 46,309 senses | 2,388 example sentences.

    19. Hebrew Monolingual Dictionary Data: 85,600 words | 104,100 senses | 94,000 example sentences.

    20. Hebrew Bilingual Dictionary Data: 67,000 translations | 49,000 senses | 19,500 example translations.

    21. Hungarian Monolingual Dictionary Data: 90,500 words | 155,300 senses | 42,500 example sentences.

    22. Italian Monolingual Dictionary Data: 102,500 words | 231,580 senses | 48,200 example sentences.

    23. Italian Bilingual Dictionary Data: 492,000 translations | 251,600 senses | 157,100 example translations.

    24. Italian Synonyms and Antonyms Data: 197,000 synonyms | 62,000 antonyms.

    25. Latvian Monolingual Dictionary Data: 36,000 words | 43,600 senses | 73,600 example sentences.

    26. Polish Bilingual Dictionary Data: 287,400 translations | 216,900 senses | 19,800 example translations.

    27. Portuguese Monolingual Dictionary Data: 143,600 words | 285,500 senses | 69,300 example sentences.

    28. Portuguese Bilingual Dictionary Data: 300,000 translations | 158,000 senses | 117,800 example translations.

    29. Portuguese Synonyms and Antonyms Data: 196,000 synonyms | 90,000 antonyms.

    30. Romanian Monolingual Dictionary Data: 66,900 words | 113,500 senses | 2,700 example sentences.

    31. Romanian Bilingual Dictionary Data: 77,500 translations | 63,870 senses | 33,730 example translations.

    32. Russian Monolingual Dictionary Data: 65,950 words | 57,500 senses | 51,900 example sentences.

    33. Russian Bilingual Dictionary Data: 230,100 translations | 122,200 senses | 69,600 example translations.

    34. Slovak Bilingual Dictionary Dat...

  12. Example instance of keyword label consolidation by the Delphi leader before...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tony Allard; Paul Alvino; Leslie Shing; Allan Wollaber; Joseph Yuen (2023). Example instance of keyword label consolidation by the Delphi leader before instituting the Delphi process. [Dataset]. http://doi.org/10.1371/journal.pone.0211486.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Tony Allard; Paul Alvino; Leslie Shing; Allan Wollaber; Joseph Yuen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example instance of keyword label consolidation by the Delphi leader before instituting the Delphi process.

  13. National Energy Efficiency Data-Framework (NEED) data explorer

    • s3.amazonaws.com
    • gov.uk
    Updated Aug 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department for Business, Energy & Industrial Strategy (2021). National Energy Efficiency Data-Framework (NEED) data explorer [Dataset]. https://s3.amazonaws.com/thegovernmentsays-files/content/174/1744763.html
    Explore at:
    Dataset updated
    Aug 5, 2021
    Dataset provided by
    GOV.UKhttp://gov.uk/
    Authors
    Department for Business, Energy & Industrial Strategy
    Description

    The data explorer allows users to create bespoke cross tabs and charts on consumption by property attributes and characteristics, based on the data available from NEED. Two variables can be selected at once (for example property age and property type), with mean, median or number of observations shown in the table. There is also a choice of fuel (electricity or gas). The data spans 2007 to 2019.

    Figures provided in the latest version of the tool (June 2021) are based on data used in the June 2021 National Energy Efficiency Data-Framework (NEED) publication. More information on the development of the framework, headline results and data quality are available in the publication. There are also additional detailed tables including distributions of consumption and estimates at local authority level. The data are also available as a comma separated value (csv) file.

    Error notice: revisions to the June 2021 Domestic NEED annual report

    We identified 2 processing errors in this edition of the Domestic NEED Annual report and corrected them. The changes are small and do not affect the overall findings of the report, only the domestic energy consumption estimates. The impact of energy efficiency measures analysis remains unchanged. The revisions are summarised on the Domestic NEED Report 2021 release page.

    If you have any queries or comments on these outputs please contact: energyefficiency.stats@beis.gov.uk.

    https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1008742/NEED_data_explorer_2021.xlsm">https://www.gov.uk/assets/whitehall/pub-cover-a380604bb953dc22ac9dcfbf3cc65598327f493c37b09ac497c45148cbaa21b1.png">

    NEED data explorer

    XLSM, 2.51MB

     <div data-module="toggle" class="accessibility-warning" id="attachment-5443382-accessibility-help">
      <p>This file may not be suitable for users of assistive technology.</p>
      <details class="gem-c-details govuk-details govuk-!-margin-bottom-3">
    

    Request an accessible format.

       If you use assistive technology (such as a screen reader) and need a
    

    version of this document in a more accessible format, please email enquiries@beis.gov.uk. Please tell us what format you need. It will help us if you say what assistive technology you use.

    https://www.gov.uk/assets/whitehall/pub-cover-spreadsheet-471052e0d03e940bbc62528a05ac204a884b553e4943e63c8bffa6b8baef8967.png">

    NEED data explorer

    View online <a href="

  14. Z

    ZeroCostDL4Mic - pix2pix example training and test dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Jacquemet (2020). ZeroCostDL4Mic - pix2pix example training and test dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3941888
    Explore at:
    Dataset updated
    Jul 14, 2020
    Dataset provided by
    Åbo Akademi University, Turku, Finland
    Authors
    Guillaume Jacquemet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Name: ZeroCostDL4Mic - pix2pix example training and test dataset

    (see our Wiki for details)

    Data type: Paired microscopy images (fluorescence) of lifeact-RFP and sir-DNA

    Microscopy data type: Fluorescence microscopy

    Microscope: Spinning disk confocal microscope with a 20x 0.8 NA objective

    Cell type: DCIS.COM Lifeact-RFP cells

    File format: .png (RGB images)

    Image size: 1024x1024 (Pixel size: 634 nm)

    Author(s): Guillaume Jacquemet1,2,3

    Contact email: guillaume.jacquemet@abo.fi

    Affiliation(s) :

    1) Faculty of Science and Engineering, Cell Biology, Åbo Akademi University, 20520 Turku, Finland

    2) Turku Bioscience Centre, University of Turku and Åbo Akademi University, FI-20520 Turku

    3) ORCID: 0000-0002-9286-920X

    Associated publications: Unpublished

    Funding bodies: G.J. was supported by grants awarded by the Academy of Finland, the Sigrid Juselius Foundation and Åbo Akademi University Research Foundation (CoE CellMech) and by Drug Discovery and Diagnostics strategic funding to Åbo Akademi University.

  15. Teknofest Model Sattelite Data Set Example

    • kaggle.com
    zip
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sabri Hakan Demirbaş (2023). Teknofest Model Sattelite Data Set Example [Dataset]. https://www.kaggle.com/datasets/sabrihakandemirba/teknofest-model-sattelite-data-set-example
    Explore at:
    zip(3630 bytes)Available download formats
    Dataset updated
    Jun 4, 2023
    Authors
    Sabri Hakan Demirbaş
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This data set includes the data recorded during the pre-flight tests of the RHEA Model Satellite Team competing in the 2023 Turksat Model Satellite Competition organized by Teknofest.

    Contents

    Some data may be missing or incorrect. The reason is either there was a malfunction, we could not save the data, or we could not run it at all.

    You can access the contest specifications from this link. So you fully understand the task and what they mean. https://cdn.teknofest.org/media/upload/userFormUpload/T-MUY_2023_Yar%C4%B1%C5%9Fma_K%C4%B1lavuzu_HE4XU_V4hvY.pdf

    • paket_numarasi - package number. It is the sequential number assigned to each telemetry packet generated at the time of the competition and sent to the ground station. The first packet starts with "1" and continues sequentially. In case of a restart of the processor, the packets should continue from the last left number.

    • uydu_statusu - satellite status. It is the information to be specified numerically, showing the status of the model satellite during the mission. It is obligatory to create the following statuses numerically. 0: Ready-to-Fly (Before the Rocket is Fired) 1: Ascension 2: Model Satellite Landing 3: Separation 4: Payload Landing 5: Recovery (Payload Ground Contact) 6: Package Video (500 KB) Received 7: Package Video (500 KB) Sent (Bonus Quest)

    • hata_kodu - It is a 5-digit telemetry data consisting of 0 or 1 to be created according to the specified error conditions.

    • gonderme_saat - It is real-time clock data in the form of Day/Month/Year, Hour/Minute/Second.

    • basinc1 - It is the atmospheric pressure value measured by the sensor on the payload. Its unit is Pascal.

    • basinc2 - It is the atmospheric pressure value measured by the sensor on the carrier. Its unit is Pascal.

    • yukseklik1 - It is the height of the payload from the starting point of flight. Height configuration; The starting point of the flight should be set to 0 meters. Its unit is a meter.

    • yukseklik2 - It is the height of the carrier from the starting point of the flight. Height configuration; The starting point of the flight should be set to 0 meters. Its unit is a meter.

    • irtifa_farki - The absolute difference between HEIGHT1 and HEIGHT2 is the value. Its unit is meter.

    • inis_hizi - Descent velocity data. Its unit is m/s

    • sicaklik - It is the measured temperature data. Its unit is degrees C.

    • pil_gerilim - Indicates the voltage of the battery. Its unit is V.

    • gps_latitude - It is the latitudinal position of the payload.

    • gps_longitude - It is the longitudinal position of the payload.

    • gps_altitude - It is the altitude data of the payload received from GPS.

    • pitch - It is the tilt angle on the pitch axis. The unit is degrees.

    • yaw - It is the tilt angle on the yaw axis. The unit is degrees.

    • roll -It is the tilt angle on the roll axis. The unit is degrees.

    • takim_no - Teams applying to the competition are given a team number after the application process is completed. It is a 5-digit number. The team number of each team is different from the number of other teams.

    • video_aktarim_bilgisi - Informs whether the camera image is recorded or not.

    In addition to the data set, you can also visit these reviews to visualize this data and examine the codes of the ground station we recorded. https://github.com/SHaken53/Yer_Istasyonu_06

  16. N

    Excel, AL Age Group Population Dataset: A Complete Breakdown of Excel Age...

    • neilsberg.com
    csv, json
    Updated Feb 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2025). Excel, AL Age Group Population Dataset: A Complete Breakdown of Excel Age Demographics from 0 to 85 Years and Over, Distributed Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/4521c211-f122-11ef-8c1b-3860777c1fe6/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Feb 22, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Excel, Alabama
    Variables measured
    Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the Excel population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Excel. The dataset can be utilized to understand the population distribution of Excel by age. For example, using this dataset, we can identify the largest age group in Excel.

    Key observations

    The largest age group in Excel, AL was for the group of age 5 to 9 years years with a population of 77 (15.28%), according to the ACS 2019-2023 5-Year Estimates. At the same time, the smallest age group in Excel, AL was the 85 years and over years with a population of 2 (0.40%). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Variables / Data Columns

    • Age Group: This column displays the age group in consideration
    • Population: The population for the specific age group in the Excel is shown in this column.
    • % of Total Population: This column displays the population of each age group as a proportion of Excel total population. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Excel Population by Age. You can refer the same here

  17. Machine Learning Basics for Beginners🤖🧠

    • kaggle.com
    zip
    Updated Jun 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhanupratap Biswas (2023). Machine Learning Basics for Beginners🤖🧠 [Dataset]. https://www.kaggle.com/datasets/bhanupratapbiswas/machine-learning-basics-for-beginners
    Explore at:
    zip(492015 bytes)Available download formats
    Dataset updated
    Jun 22, 2023
    Authors
    Bhanupratap Biswas
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:

    1. Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.

    2. Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.

    3. Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.

    4. Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).

    5. Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).

    6. Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.

    7. Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.

    8. Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.

    9. Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.

    10. Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.

    These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.

  18. 54k Resume dataset (structured)

    • kaggle.com
    zip
    Updated Nov 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suriya Ganesh (2024). 54k Resume dataset (structured) [Dataset]. https://www.kaggle.com/datasets/suriyaganesh/resume-dataset-structured
    Explore at:
    zip(39830263 bytes)Available download formats
    Dataset updated
    Nov 14, 2024
    Authors
    Suriya Ganesh
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is aggregated from sources such as

    Entirely available in the public domain.

    Resumes are usually in pdf format. OCR was used to convert the PDF into text and LLMs were used to convert the data into a structured format.

    Dataset Overview

    This dataset contains structured information extracted from professional resumes, normalized into multiple related tables. The data includes personal information, educational background, work experience, professional skills, and abilities.

    Table Schemas

    1. people.csv

    Primary table containing core information about each individual.

    Column NameData TypeDescriptionConstraintsExample
    person_idINTEGERUnique identifier for each personPrimary Key, Not Null1
    nameVARCHAR(255)Full name of the personMay be Null"Database Administrator"
    emailVARCHAR(255)Email addressMay be Null"john.doe@email.com"
    phoneVARCHAR(50)Contact numberMay be Null"+1-555-0123"
    linkedinVARCHAR(255)LinkedIn profile URLMay be Null"linkedin.com/in/johndoe"

    2. abilities.csv

    Detailed abilities and competencies listed by individuals.

    Column NameData TypeDescriptionConstraintsExample
    person_idINTEGERReference to people tableForeign Key, Not Null1
    abilityTEXTDescription of abilityNot Null"Installation and Building Server"

    3. education.csv

    Contains educational history for each person.

    Column NameData TypeDescriptionConstraintsExample
    person_idINTEGERReference to people tableForeign Key, Not Null1
    institutionVARCHAR(255)Name of educational institutionMay be Null"Lead City University"
    programVARCHAR(255)Degree or program nameMay be Null"Bachelor of Science"
    start_dateVARCHAR(7)Start date of educationMay be Null"07/2013"
    locationVARCHAR(255)Location of institutionMay be Null"Atlanta, GA"

    4. experience.csv

    Details of work experience entries.

    Column NameData TypeDescriptionConstraintsExample
    person_idINTEGERReference to people tableForeign Key, Not Null1
    titleVARCHAR(255)Job titleMay be Null"Database Administrator"
    firmVARCHAR(255)Company nameMay be Null"Family Private Care LLC"
    start_dateVARCHAR(7)Employment start dateMay be Null"04/2017"
    end_dateVARCHAR(7)Employment end dateMay be Null"Present"
    locationVARCHAR(255)Job locationMay be Null"Roswell, GA"

    4. person_skills.csv

    Mapping table connecting people to their skills.

    Column NameData TypeDescriptionConstraintsExample
    person_idINTEGERReference to people tableForeign Key, Not Null1
    skillVARCHAR(255)Reference to skills tableForeign Key, Not Null"SQL Server"

    5. skills.csv

    Master list of unique skills mentioned across all resumes.

    Column NameData TypeDescriptionConstraintsExample
    skillVARCHAR(255)Unique skill namePrimary Key, Not Null"SQL Server"

    Relationships

    • Each person (people.csv) can have:
      • Multiple education entries (education.csv)
      • Multiple experience entries (experience.csv)
      • Multiple skills (person_skills.csv)
      • Multiple abilities (abilities.csv)
    • Skills (skills.csv) can be associated with multiple people
    • All relationships are maintained through the person_id field

    Data Characteristics

    Date Formats

    • All dates are stored in MM/YYYY format
    • Current positions use "Present" for end_date

    Text Fields

    • All text fields preserve original case
    • NULL values indicate missing information
    • No maximum length enforced for TEXT fields
    • VARCHAR fields have practical limits noted in schema

    Identifiers

    • person_id starts at 1 and increments sequentially
    • No natural or composite keys used
    • All relationships maintained through person_id

    Common Usage Patterns

    Basic Queries

    -- Get all skills for a person
    SELECT s.skill 
    FROM person_skills ps
    JOIN skills s ON ps.skill = s.skill
    WHERE ps.person_id = 1;
    
    -- Get complete work history
    SELECT * 
    FROM experience
    WHERE person_id = 1
    ORDER BY start_date DESC;
    

    Analytics Queries

    -- Most common skills
    SELECT s.skill, COUNT(*) as frequency
    FROM person_skills ps
    ...
    
  19. Company Documents Dataset

    • kaggle.com
    zip
    Updated May 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayoub Cherguelaine (2024). Company Documents Dataset [Dataset]. https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset
    Explore at:
    zip(9789538 bytes)Available download formats
    Dataset updated
    May 23, 2024
    Authors
    Ayoub Cherguelaine
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview

    This dataset contains a collection of over 2,000 company documents, categorized into four main types: invoices, inventory reports, purchase orders, and shipping orders. Each document is provided in PDF format, accompanied by a CSV file that includes the text extracted from these documents, their respective labels, and the word count of each document. This dataset is ideal for various natural language processing (NLP) tasks, including text classification, information extraction, and document clustering.

    Dataset Content

    PDF Documents: The dataset includes 2,677 PDF files, each representing a unique company document. These documents are derived from the Northwind dataset, which is commonly used for demonstrating database functionalities.

    The document types are:

    • Invoices: Detailed records of transactions between a buyer and a seller.
    • Inventory Reports: Records of inventory levels, including items in stock and units sold.
    • Purchase Orders: Requests made by a buyer to a seller to purchase products or services.
    • Shipping Orders: Instructions for the delivery of goods to specified recipients.

    Example Entries

    Here are a few example entries from the CSV file:

    Shipping Order:

    • Order ID: 10718
    • Shipping Details: "Ship Name: Königlich Essen, Ship Address: Maubelstr. 90, Ship City: ..."
    • Word Count: 120

    Invoice:

    • Order ID: 10707
    • Customer Details: "Customer ID: Arout, Order Date: 2017-10-16, Contact Name: Th..."
    • Word Count: 66

    Purchase Order:

    • Order ID: 10892
    • Order Details: "Order Date: 2018-02-17, Customer Name: Catherine Dewey, Products: Product ..."
    • Word Count: 26

    Applications

    This dataset can be used for:

    • Text Classification: Train models to classify documents into their respective categories.
    • Information Extraction: Extract specific fields and details from the documents.
    • Document Clustering: Group similar documents together based on their content.
    • OCR and Text Mining: Improve OCR (Optical Character Recognition) models and text mining techniques using real-world data.
  20. Colon-Cancer-datasets

    • kaggle.com
    zip
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Apn_Gupta (2025). Colon-Cancer-datasets [Dataset]. https://www.kaggle.com/datasets/apngupta/colon-cancer-datasets
    Explore at:
    zip(235188821 bytes)Available download formats
    Dataset updated
    Jun 20, 2025
    Authors
    Apn_Gupta
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    🧬 Colon Cancer Histopathology Dataset

    This dataset contains histopathological image data for the identification of colon cancer using deep learning. It includes high-resolution images labeled as cancerous or non-cancerous, intended for training and validating computer vision models in medical imaging.

    📁 Dataset Structure

    The dataset is organised into two main image folders and two supporting CSV files:

    ├── train/       # 7,560 labelled images for training
    ├── test/        # 5,041 unlabeled images for inference/testing
    ├── train.csv      # Contains image filenames and corresponding labels (for train/ folder)
    ├── example.csv     # Sample format for custom data input
    

    📊 Description

    Folder/FileDescription
    train/Contains labeled histopathology images
    test/Contains images without labels for model inference
    train.csvCSV file with two columns: image_id, label
    example.csvA demonstration CSV with the expected structure
    • Label Encoding:

      • Id → The Id of the Image
      • Type → Cancer / Connective / Immune / Normal

    💡 Usage Example

    Load the training labels:

    import pandas as pd
    df = pd.read_csv("train.csv")
    print(df.head())
    

    Read an image:

    from PIL import Image
    img = Image.open("train/image_00123.jpg")
    img.show()
    

    📦 Intended Use

    • 🔍 Research in medical imaging and digital pathology
    • 🧠 Training deep learning models (CNNs, transfer learning)
    • 🧪 Educational purposes for learning supervised image classification

    ⚠️ Licensing & Ethics

    • Please ensure ethical use, especially in any clinical or diagnostic context.
    • Dataset is for educational and research purposes only.
    • Source data must be anonymised and not traceable to patients.

    🙋‍♂️ Contact & Attribution

    Uploaded by: Arpan Gupta
    Full project using this dataset: GitHub Repo
    Notebook Using Dataset: Kaggle

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Unique Data (2023). Generated E-mail Spam by LLM [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/generated-e-mail-spam
Organization logo

Generated E-mail Spam by LLM

300 e-mail spam titles & texts with Davinci Open AI

Explore at:
zip(78131 bytes)Available download formats
Dataset updated
Sep 20, 2023
Authors
Unique Data
License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

Generated E-mail Spam - text classification dataset

The dataset consists of a CSV file containing of 300 generated email spam messages. Each row in the file represents a separate email message, its title and text. The dataset aims to facilitate the analysis and detection of spam emails.

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on our website to buy the dataset

The dataset can be used for various purposes, such as training machine learning algorithms to classify and filter spam emails, studying spam email patterns, or analyzing text-based features of spam messages.

Generated Data

The data was generated using model text-davinci-003 Open AI API

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Fdefd7209a4510c98e556ca384c8ace68%2Finbox_618942_4d1fdedb2827152696dd0c0af05fd8da_f.png?generation=1695221394608089&alt=media" alt="">

🧩 This is just an example of the data. Leave a request here to learn more

Content

File with the extension .csv (utf-8)

includes the following information:

  • title: title of the email,
  • text: text of the email

Email spam might be generated in accordance with your requirements.

🚀 You can learn more about our high-quality unique datasets here

keywords: spam mails dataset, email spam classification, spam or not-spam, spam e-mail database, spam detection system, email spamming data set, spam filtering system, spambase, feature extraction, spam ham email dataset, classifier, machine learning algorithms, automated, generated data, synthetic data, synthetic data generation, synthetic dataset , cybersecurity, text dataset, sentiment analysis, llm dataset, language modeling, large language models, text classification, text mining dataset, natural language texts, nlp, nlp open-source dataset, text data

Search
Clear search
Close search
Google apps
Main menu