Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset consists of a CSV file containing of 300 generated email spam messages. Each row in the file represents a separate email message, its title and text. The dataset aims to facilitate the analysis and detection of spam emails.
The dataset can be used for various purposes, such as training machine learning algorithms to classify and filter spam emails, studying spam email patterns, or analyzing text-based features of spam messages.
The data was generated using model text-davinci-003 Open AI API
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Fdefd7209a4510c98e556ca384c8ace68%2Finbox_618942_4d1fdedb2827152696dd0c0af05fd8da_f.png?generation=1695221394608089&alt=media" alt="">
includes the following information:
🚀 You can learn more about our high-quality unique datasets here
keywords: spam mails dataset, email spam classification, spam or not-spam, spam e-mail database, spam detection system, email spamming data set, spam filtering system, spambase, feature extraction, spam ham email dataset, classifier, machine learning algorithms, automated, generated data, synthetic data, synthetic data generation, synthetic dataset , cybersecurity, text dataset, sentiment analysis, llm dataset, language modeling, large language models, text classification, text mining dataset, natural language texts, nlp, nlp open-source dataset, text data
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Company shareholder email list is a database containing contact information for people who own shares in a company. This list is vital for business. It is a key tool for investor relations and often a legal must for public companies. It allows a company to send important updates directly to shareholders. This includes annual reports, news releases, and announcements about meetings or dividends. The list helps a company build trust and maintain a strong relationship with its investors. It is a key tool for keeping shareholders engaged and informed about the company’s performance and future. For example, if a company has news, like how much money it made, it can use the shareholder list. They send the message directly to shareholders.
Company shareholder email list makes their messages more relevant and effective. A company can use this data to send personal messages. They can send a message based on how many shares a person owns. Public companies have legal requirements to communicate with their shareholders. This list helps them meet those obligations easily and efficiently. In short, this lead is vital. It helps you manage investor relationships. Also, it ensures compliance and keeps shareholders engaged. List to Data is a great tool for growing your business. Company shareholder email database is a special version of the shareholder dataset. It focuses on the most active and engaged shareholder. The company involves these people. They might go to meetings, vote on important decisions, or talk to the company’s leaders. This list also makes it easier for companies to send the right messages to the right people. For example, if a company knows where shareholders live, it can send them more relevant messages. This makes communication better for everyone.
Company shareholder email database is helpful for companies because it helps them communicate better with their shareholder. For example, if a company can use this database to share news. They can tell shareholders about a new product. They can also ask them to vote on important things.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains information about 200 students with 7 attributes. It is designed for use in exploratory data analysis, SQL practice, or machine learning experiments related to student performance or demographics.
StudentID: Unique identifier for each student. Type: Integer (4-digit unique number). Example: 1023, 3045.
Name: Full name of the student. Type: String (Text). Example: John Doe, Alice Smith.
Age: Age of the student. Type: Integer (Range: 18–25). Example: 21, 19.
Email: Email address of the student. Type: String (Unique). Example: john.doe@example.com, alice.smith@example.org.
Department: Department or major of the student. Type: String (Categorical).
Categories: Computer Science Biology Mathematics Physics Chemistry Example: Biology, Computer Science.
GPA: Grade Point Average (GPA) of the student (on a 4.0 scale). Type: Float (2 decimal points). Range: 2.0–4.0. Example: 3.45, 3.89.
GraduationYear: Expected graduation year of the student. Type: Integer. Range: 2024–2030. Example: 2025, 2028.
Data Characteristics Rows: 200 (One for each student). Columns: 7 (Attributes described above).
Unique Values: Each StudentID and Email is unique. The Department column contains 5 distinct values.
Facebook
TwitterThe study was conducted in Serbia between October 2008 and February 2009 as part of the first round of The Management, Organization and Innovation Survey. Data from 135 manufacturing companies with 50 to 5,000 full-time employees was analyzed.
The survey topics include detailed information about a company and its management practices - production performance indicators, production target, ways employees are promoted/dealt with when underperforming. The study also focuses on organizational matters, innovation, spending on research and development, production outsourcing to other countries, competition, and workforce composition.
The primary sampling unit of the study is the establishment. An establishment is a physical location where business is carried out and where industrial operations take place or services are provided. A firm may be composed of one or more establishments. For example, a brewery may have several bottling plants and several establishments for distribution. For the purposes of this survey an establishment is defined as a separate production unit, regardless of whether or not it has its own financial statements separate from those of the firm, and whether it has it own management and control over payroll. So the bottling plant of a brewery would be counted as an establishment.
The survey universe was defined as manufacturing establishments with at least fifty, but less than 5,000, full-time employees.
Sample survey data [ssd]
Random sampling was used in the study. For all MOI countries, except Russia, there was a requirement that all regions must be covered and that the percentage of the sample in each region was required to be equal to at least one half of the percentage of the sample frame population in each region.
In most countries the sample frame used was an extract from the Orbis database of Bureau van Dijk, which was provided to the Consultant by the EBRD. The sample frame contained details of company names, location, company size (number of employees), company performance measures and contact details. The sample frame downloaded from Orbis was cleaned by the EBRD through the addition of regional variables, updating addresses and phone numbers of companies.
Examination of the Orbis sample frames showed their geographic distributions to be wide with many locations, a large number of which had only a small number of records. Each establishment was selected with two substitutes that can be used if it proves impossible to conduct an interview at the first establishment. In practice selection was confined to locations with the most records in the sample frame, so the sample frame was filtered to just the cities with the most establishments.
The quality of the frame was assessed at the onset of the project. The frame proved to be useful though it showed positive rates of non-eligibility, repetition, non-existent units, etc. These problems are typical of establishment surveys. For Serbia, the percentage of confirmed non-eligible units as a proportion of the total number of contacts to complete the survey was 26.7% (82 out of 307 establishments).
Face-to-face [f2f]
Two different versions of the questionnaire were used. Questionnaire A was used when interviewing establishments that are part of multiestablishment firms, while Questionnaire B was used when interviewing single-establishment firms. Questionnaire A incorporates all questions from Questionnaire B, the only difference is in the reference point, which is the so-called national firm in the first part of Questionnaire A and firm in Questionnaire B. Second part of the questionnaire refers to the interviewed establishment only in both Questionnaire A and Questionnaire B. Each variation of the questionnaire is identified by the index variable, a0.
Item non-response was addressed by two strategies: - For sensitive questions that may generate negative reactions from the respondent, such as ownership information, enumerators were instructed to collect the refusal to respond as (-8). - Establishments with incomplete information were re-contacted in order to complete this information, whenever necessary. However, there were clear cases of low response.
Survey non-response was addressed by maximising efforts to contact establishments that were initially selected for interviews. Up to 15 attempts (but at least 4 attempts) were made to contact an establishment for interview at different times/days of the week before a replacement establishment (with similar characteristics) was suggested for interview. Survey non-response did occur, but substitutions were made in order to potentially achieve the goals.
Additional information about sampling, response rates and survey implementation can be found in "MOI Survey Report on Methodology and Observations 2009" in "Technical Documents" folder.
Facebook
TwitterComprehensive Chinese language datasets with linguistic annotations, including headwords, definitions, word senses, usage examples, part-of-speech (POS) tags, semantic metadata, and contextual usage details. Covering Simplified and Traditional writing systems.
Our Chinese language datasets are carefully compiled and annotated by language and linguistic experts. The below datasets are available for license:
Key Features (approximate numbers):
Our Mandarin Chinese (simplified) monolingual features clear definitions, headwords, examples, and comprehensive coverage of the Mandarin Chinese language spoken today.
Our Mandarin Chinese (traditional) monolingual features clear definitions, headwords, examples, and comprehensive coverage of the Mandarin Chinese language spoken today.
The bilingual data provides translations in both directions, from English to Mandarin Chinese (simplified) and from Mandarin Chinese (simplified) to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality.
The bilingual data provides translations in both directions, from English to Mandarin Chinese (traditional) and from Mandarin Chinese (traditional) to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality.
The Mandarin Chinese (simplified) Synonyms and Antonyms Dataset is a leading resource offering comprehensive, up-to-date coverage of word relationships in contemporary Mandarin Chinese. It includes rich linguistic detail such as precise definitions and part-of-speech (POS) tags, making it an essential asset for developing AI systems and language technologies that require deep semantic understanding.
Use Cases:
We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).
If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.
Pricing:
Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.
Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.
Please note that some datasets may have rights restrictions. Contact us for more information.
About the sample:
The samples offer a brief overview of one or two language datasets (monolingual or/and bilingual dictionary data). To help you explore the structure and features of our dataset, we provide a sample in CSV format for preview purposes only.
If you need the complete original sample or more details about any dataset, please contact us (Growth.OL@oup.com) to request access or further information.
Facebook
TwitterIntroduction The Randomized Aldactone Evaluation Study (RALES) randomized 822 patients to receive 25 mg spironolactone daily and 841 to receive placebo. The primary endpoint was death from all causes. Randomization began on March 24, 1995; recruitment was completed on December 31, 1996; follow-up was scheduled to continue through December 31, 1999. Evidence of a sizeable benefit on mortality emerged early in the RALES. The RALES data safety monitoring board (DSMB), which met semiannually throughout the trial, used a prespecified statistical guideline to recommend stopping for efficacy. At the DSMB's request, its meetings were preceded by an 'endpoint sweep', that is, a census of all participants to confirm their vital status. Methods We used computer simulation to evaluate the effect of the sweeps. Results The sweeps led to an estimated 5 to 8% increase in the number of reported deaths at the fourth and fifth interim analyses. The data crossed the statistical boundary at the fifth interim analysis. If investigators had reported all deaths within the protocol-required 24-h window, the DSMB might have recommended stopping after the fourth interim analysis. Discussion Although endpoint sweeps can cause practical problems at the clinical centers, sweeps are very useful if the intervals between patient visits or contact are long or if endpoints require adjudication by committee, reading center, or central laboratory. Conclusion We recommend that trials with interim analyses institute active reporting of the primary endpoints and endpoint sweeps.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Name: ZeroCostDL4Mic - YoloV2 example training and test dataset
(see our Wiki for details)
Data type: 2D grayscale .png images with corresponding bounding box annotations in .xml PASCAL Voc format.
Microscopy data type: Phase contrast microscopy data (brightfield)
Microscope: Inverted Zeiss Axio zoom widefield microscope equipped with an AxioCam MRm camera, an EL Plan-Neofluar 20 × /0.5 NA objective (Carl Zeiss), with a heated chamber (37 °C) and a CO2 controller (5%).
Cell type: MDA-MB-231 cells migrating on cell-derived matrices generated by fibroblasts.
File format: .png (8-bit)
Image size: 1388 x 1040 px (323 nm)
Author(s): Guillaume Jacquemet1,2,3, Lucas von Chamier4,5
Contact email: lucas.chamier.13@ucl.ac.uk and guillaume.jacquemet@abo.fi
Affiliation(s):
1) Faculty of Science and Engineering, Cell Biology, Åbo Akademi University, 20520 Turku, Finland
2) Turku Bioscience Centre, University of Turku and Åbo Akademi University, FI-20520 Turku
3) ORCID: 0000-0002-9286-920X
4) MRC-Laboratory for Molecular Cell Biology. University College London, London, UK
5) ORCID: 0000-0002-9243-912X
Associated publications: Jacquemet et al 2016. DOI: 10.1038/ncomms13297
Funding bodies: G.J. was supported by grants awarded by the Academy of Finland, the Sigrid Juselius Foundation and Åbo Akademi University Research Foundation (CoE CellMech) and by Drug Discovery and Diagnostics strategic funding to Åbo Akademi University.
Facebook
TwitterBackgroundThe COVID-19 pandemic changed the way many industries work, including contact centres, with some employees working from home and new on-site restrictions/measures in place representing even greater challenges for employers around staff engagement and wellbeing. This study aimed to understand the interplay of individual, social, environmental and organisational factors influencing physical activity and sedentary behaviour in UK contact centre employees, how the pandemic impacted these factors, and their relevance for the future of hybrid working.MethodsIndividual interviews (n = 33) were conducted with participants (staff working full and part time, on site and from home) from four UK contact centres. A topic guide based on the ecological model was developed to understand current barriers and facilitators to physical activity and (reducing) sedentary behaviour during and outside of working hours. Thematic analysis was carried out using a codebook and a deductive coding approach to identify themes.ResultsThree key insights are provided. First, participants felt they were generally sitting more and moving less since the first UK-wide lockdown. Second, factors which negatively impacted on these behaviours were evident across all levels of the ecological model. These included individual and social barriers (e.g., lack of motivation and preferable physical activity options) as well as environmental and organisational barriers (e.g., poor home office setup, back-to-back virtual meetings). There were a mix of new and existing barriers (exacerbated by the pandemic) and several of these were linked to homeworking. Third, organisational support requirements (e.g., homeworking ergonomic support) and existing facilitators (such as the provision of informational support and flexible working arrangements) were identified.ConclusionSolutions to reduce sedentary behaviours and increase physical activity in contact centres need to address barriers from the individual to the organisational level. Whilst the study was undertaken in the UK, the results are like to be applicable globally.Trial registrationClinical trial registration: The trial for the wider project has been registered on the ISRCTN database: http://www.isrctn.com/ISRCTN11580369.
Facebook
TwitterThe study was conducted in Belarus between October 2008 and February 2009 as part of the first round of The Management, Organization and Innovation Survey. Data from 102 manufacturing companies with 50 to 5,000 full-time employees was analyzed.
The survey topics include detailed information about a company and its management practices - production performance indicators, production target, ways employees are promoted/dealt with when underperforming. The study also focuses on organizational matters, innovation, spending on research and development, production outsourcing to other countries, competition, and workforce composition.
National
The primary sampling unit of the study is the establishment. An establishment is a physical location where business is carried out and where industrial operations take place or services are provided. A firm may be composed of one or more establishments. For example, a brewery may have several bottling plants and several establishments for distribution. For the purposes of this survey an establishment is defined as a separate production unit, regardless of whether or not it has its own financial statements separate from those of the firm, and whether it has it own management and control over payroll. So the bottling plant of a brewery would be counted as an establishment.
The survey universe was defined as manufacturing establishments with at least fifty, but less than 5,000, full-time employees.
Sample survey data [ssd]
Random sampling was used in the study. For all MOI countries, except Russia, there was a requirement that all regions must be covered and that the percentage of the sample in each region was required to be equal to at least one half of the percentage of the sample frame population in each region.
In most countries the sample frame used was an extract from the Orbis database of Bureau van Dijk, which was provided to the Consultant by the EBRD. The sample frame contained details of company names, location, company size (number of employees), company performance measures and contact details. The sample frame downloaded from Orbis was cleaned by the EBRD through the addition of regional variables, updating addresses and phone numbers of companies.
Examination of the Orbis sample frames showed their geographic distributions to be wide with many locations, a large number of which had only a small number of records. Each establishment was selected with two substitutes that can be used if it proves impossible to conduct an interview at the first establishment. In practice selection was confined to locations with the most records in the sample frame, so the sample frame was filtered to just the cities with the most establishments.
The quality of the frame was assessed at the onset of the project. The frame proved to be useful though it showed positive rates of non-eligibility, repetition, non-existent units, etc. These problems are typical of establishment surveys. For Belarus, the percentage of confirmed non-eligible units as a proportion of the total number of contacts to complete the survey was 30.6% (83 out of 271 establishments).
Face-to-face [f2f]
Two different versions of the questionnaire were used. Questionnaire A was used when interviewing establishments that are part of multiestablishment firms, while Questionnaire B was used when interviewing single-establishment firms. Questionnaire A incorporates all questions from Questionnaire B, the only difference is in the reference point, which is the so-called national firm in the first part of Questionnaire A and firm in Questionnaire B. Second part of the questionnaire refers to the interviewed establishment only in both Questionnaire A and Questionnaire B. Each variation of the questionnaire is identified by the index variable, a0.
Item non-response was addressed by two strategies: - For sensitive questions that may generate negative reactions from the respondent, such as ownership information, enumerators were instructed to collect the refusal to respond as (-8). - Establishments with incomplete information were re-contacted in order to complete this information, whenever necessary. However, there were clear cases of low response.
Survey non-response was addressed by maximising efforts to contact establishments that were initially selected for interviews. Up to 15 attempts (but at least 4 attempts) were made to contact an establishment for interview at different times/days of the week before a replacement establishment (with similar characteristics) was suggested for interview. Survey non-response did occur, but substitutions were made in order to potentially achieve the goals.
Additional information about sampling, response rates and survey implementation can be found in "MOI Survey Report on Methodology and Observations 2009" in "Technical Documents" folder.
Facebook
TwitterDataset Overview:1. First volume of Reconstructed Sequences (located at subfolder "Representative Figures Data" )Data type: 3D reconstruction (first volume) of mitochondria–lysosome interactionFile format: .tif (8-bit)Image size: 1890x1380x161 px3 for mitochondria and lysosome, 1500x1830x101 px3 for peroxisome(Properties: 0.054x0.054x0.054 um3)2. Example Training Data for Alpha-LFM Net (located at subfolder "Example_Training_Data" )Data type: Lysosome training images (~ 2000 pairs) used for Alpha-Net training in light-field microscopyFile format: .tif (8-bit)Image size: 480x480 px2 for LF projections (Properties: 0.108x0.108 um2) 480x480x161 px3 for 3D volumes (Properties: 0.054x0.054x0.054 um3).--------------------------------------------------------------------------------------------------------------------------------------------Authors: Lanxin Zhu & Jiahao Sun & Chengqiang Yi & Meng ZhangContact email: lanxinzhu@hust.edu.cn
Facebook
TwitterEMEA Data Suite offers 43 high-quality language datasets covering 23 languages spoken in the region. Ideal for NLP, AI, LLMs, translation, and education, it combines linguistic depth and regional authenticity to power scalable, multilingual language technologies.
Discover our expertly curated language datasets in the EMEA Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:
Monolingual and Bilingual Dictionary Data Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.
Sentence Corpora Curated examples of real-world usage with contextual annotations for training and evaluation.
Synonyms & Antonyms Lexical relations to support semantic search, paraphrasing, and language understanding.
Audio Data Native speaker recordings for speech recognition, TTS, and pronunciation modeling.
Word Lists Frequency-ranked and thematically grouped lists for vocabulary training and NLP tasks.
Each language may contain one or more types of language data. Depending on the dataset, we can provide these in formats such as XML, JSON, TXT, XLSX, CSV, WAV, MP3, and more. Delivery is currently available via email (link-based sharing) or REST API.
If you require more information about a specific dataset, please contact us Growth.OL@oup.com.
Below are the different types of datasets available for each language, along with their key features and approximate metrics. If you have any questions or require additional assistance, please don't hesitate to contact us.
Arabic Monolingual Dictionary Data: 66,500 words | 98,700 senses | 70,000 example sentences.
Arabic Bilingual Dictionary Data: 116,600 translations | 88,300 senses | 74,700 example translations.
Arabic Synonyms and Antonyms Data: 55,100 synonyms.
British English Monolingual Dictionary Data: 146,000 words | 230,000 senses | 149,000 example sentences.
British English Synonyms and Antonyms Data: 600,000 synonyms | 22,000 antonyms
British English Pronunciations with Audio: 250,000 transcriptions (IPA) |180,000 audio files.
Catalan Monolingual Dictionary Data: 29,800 words | 47,400 senses | 25,600 example sentences.
Catalan Bilingual Dictionary Data: 76,800 translations | 109,350 senses | 26,900 example translations.
Croatian Monolingual Dictionary Data: 129,600 words | 164,760 senses | 34,630 example sentences.
Croatian Bilingual Dictionary Data: 100,700 translations | 91,600 senses | 10,180 example translations.
Czech Bilingual Dictionary Data: 426,473 translations | 199,800 senses | 95,000 example translations.
Danish Bilingual Dictionary Data: 129,000 translations | 91,500 senses | 23,000 example translations.
French Monolingual Dictionary Data: 42,000 words | 56,000 senses | 43,000 example sentences.
French Bilingual Dictionary Data: 380,000 translations | 199,000 senses | 146,000 example translations.
German Monolingual Dictionary Data: 85,500 words | 78,000 senses | 55,000 example sentences.
German Bilingual Dictionary Data: 393,000 translations | 207,500 senses | 129,500 example translations.
German Word List Data: 338,000 wordforms.
Greek Monolingual Dictionary Data: 47,800 translations | 46,309 senses | 2,388 example sentences.
Hebrew Monolingual Dictionary Data: 85,600 words | 104,100 senses | 94,000 example sentences.
Hebrew Bilingual Dictionary Data: 67,000 translations | 49,000 senses | 19,500 example translations.
Hungarian Monolingual Dictionary Data: 90,500 words | 155,300 senses | 42,500 example sentences.
Italian Monolingual Dictionary Data: 102,500 words | 231,580 senses | 48,200 example sentences.
Italian Bilingual Dictionary Data: 492,000 translations | 251,600 senses | 157,100 example translations.
Italian Synonyms and Antonyms Data: 197,000 synonyms | 62,000 antonyms.
Latvian Monolingual Dictionary Data: 36,000 words | 43,600 senses | 73,600 example sentences.
Polish Bilingual Dictionary Data: 287,400 translations | 216,900 senses | 19,800 example translations.
Portuguese Monolingual Dictionary Data: 143,600 words | 285,500 senses | 69,300 example sentences.
Portuguese Bilingual Dictionary Data: 300,000 translations | 158,000 senses | 117,800 example translations.
Portuguese Synonyms and Antonyms Data: 196,000 synonyms | 90,000 antonyms.
Romanian Monolingual Dictionary Data: 66,900 words | 113,500 senses | 2,700 example sentences.
Romanian Bilingual Dictionary Data: 77,500 translations | 63,870 senses | 33,730 example translations.
Russian Monolingual Dictionary Data: 65,950 words | 57,500 senses | 51,900 example sentences.
Russian Bilingual Dictionary Data: 230,100 translations | 122,200 senses | 69,600 example translations.
Slovak Bilingual Dictionary Dat...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example instance of keyword label consolidation by the Delphi leader before instituting the Delphi process.
Facebook
TwitterThe data explorer allows users to create bespoke cross tabs and charts on consumption by property attributes and characteristics, based on the data available from NEED. Two variables can be selected at once (for example property age and property type), with mean, median or number of observations shown in the table. There is also a choice of fuel (electricity or gas). The data spans 2007 to 2019.
Figures provided in the latest version of the tool (June 2021) are based on data used in the June 2021 National Energy Efficiency Data-Framework (NEED) publication. More information on the development of the framework, headline results and data quality are available in the publication. There are also additional detailed tables including distributions of consumption and estimates at local authority level. The data are also available as a comma separated value (csv) file.
We identified 2 processing errors in this edition of the Domestic NEED Annual report and corrected them. The changes are small and do not affect the overall findings of the report, only the domestic energy consumption estimates. The impact of energy efficiency measures analysis remains unchanged. The revisions are summarised on the Domestic NEED Report 2021 release page.
If you have any queries or comments on these outputs please contact: energyefficiency.stats@beis.gov.uk.
XLSM, 2.51MB
<div data-module="toggle" class="accessibility-warning" id="attachment-5443382-accessibility-help">
<p>This file may not be suitable for users of assistive technology.</p>
<details class="gem-c-details govuk-details govuk-!-margin-bottom-3">
Request an accessible format.
If you use assistive technology (such as a screen reader) and need a
version of this document in a more accessible format, please email enquiries@beis.gov.uk. Please tell us what format you need. It will help us if you say what assistive technology you use.
View online <a href="
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Name: ZeroCostDL4Mic - pix2pix example training and test dataset
(see our Wiki for details)
Data type: Paired microscopy images (fluorescence) of lifeact-RFP and sir-DNA
Microscopy data type: Fluorescence microscopy
Microscope: Spinning disk confocal microscope with a 20x 0.8 NA objective
Cell type: DCIS.COM Lifeact-RFP cells
File format: .png (RGB images)
Image size: 1024x1024 (Pixel size: 634 nm)
Author(s): Guillaume Jacquemet1,2,3
Contact email: guillaume.jacquemet@abo.fi
Affiliation(s) :
1) Faculty of Science and Engineering, Cell Biology, Åbo Akademi University, 20520 Turku, Finland
2) Turku Bioscience Centre, University of Turku and Åbo Akademi University, FI-20520 Turku
3) ORCID: 0000-0002-9286-920X
Associated publications: Unpublished
Funding bodies: G.J. was supported by grants awarded by the Academy of Finland, the Sigrid Juselius Foundation and Åbo Akademi University Research Foundation (CoE CellMech) and by Drug Discovery and Diagnostics strategic funding to Åbo Akademi University.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data set includes the data recorded during the pre-flight tests of the RHEA Model Satellite Team competing in the 2023 Turksat Model Satellite Competition organized by Teknofest.
Some data may be missing or incorrect. The reason is either there was a malfunction, we could not save the data, or we could not run it at all.
You can access the contest specifications from this link. So you fully understand the task and what they mean. https://cdn.teknofest.org/media/upload/userFormUpload/T-MUY_2023_Yar%C4%B1%C5%9Fma_K%C4%B1lavuzu_HE4XU_V4hvY.pdf
paket_numarasi - package number. It is the sequential number assigned to each telemetry packet generated at the time of the competition and sent to the ground station. The first packet starts with "1" and continues sequentially. In case of a restart of the processor, the packets should continue from the last left number.
uydu_statusu - satellite status. It is the information to be specified numerically, showing the status of the model satellite during the mission. It is obligatory to create the following statuses numerically. 0: Ready-to-Fly (Before the Rocket is Fired) 1: Ascension 2: Model Satellite Landing 3: Separation 4: Payload Landing 5: Recovery (Payload Ground Contact) 6: Package Video (500 KB) Received 7: Package Video (500 KB) Sent (Bonus Quest)
hata_kodu - It is a 5-digit telemetry data consisting of 0 or 1 to be created according to the specified error conditions.
gonderme_saat - It is real-time clock data in the form of Day/Month/Year, Hour/Minute/Second.
basinc1 - It is the atmospheric pressure value measured by the sensor on the payload. Its unit is Pascal.
basinc2 - It is the atmospheric pressure value measured by the sensor on the carrier. Its unit is Pascal.
yukseklik1 - It is the height of the payload from the starting point of flight. Height configuration; The starting point of the flight should be set to 0 meters. Its unit is a meter.
yukseklik2 - It is the height of the carrier from the starting point of the flight. Height configuration; The starting point of the flight should be set to 0 meters. Its unit is a meter.
irtifa_farki - The absolute difference between HEIGHT1 and HEIGHT2 is the value. Its unit is meter.
inis_hizi - Descent velocity data. Its unit is m/s
sicaklik - It is the measured temperature data. Its unit is degrees C.
pil_gerilim - Indicates the voltage of the battery. Its unit is V.
gps_latitude - It is the latitudinal position of the payload.
gps_longitude - It is the longitudinal position of the payload.
gps_altitude - It is the altitude data of the payload received from GPS.
pitch - It is the tilt angle on the pitch axis. The unit is degrees.
yaw - It is the tilt angle on the yaw axis. The unit is degrees.
roll -It is the tilt angle on the roll axis. The unit is degrees.
takim_no - Teams applying to the competition are given a team number after the application process is completed. It is a 5-digit number. The team number of each team is different from the number of other teams.
video_aktarim_bilgisi - Informs whether the camera image is recorded or not.
In addition to the data set, you can also visit these reviews to visualize this data and examine the codes of the ground station we recorded. https://github.com/SHaken53/Yer_Istasyonu_06
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the Excel population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Excel. The dataset can be utilized to understand the population distribution of Excel by age. For example, using this dataset, we can identify the largest age group in Excel.
Key observations
The largest age group in Excel, AL was for the group of age 5 to 9 years years with a population of 77 (15.28%), according to the ACS 2019-2023 5-Year Estimates. At the same time, the smallest age group in Excel, AL was the 85 years and over years with a population of 2 (0.40%). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates
Age groups:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Excel Population by Age. You can refer the same here
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Sure! I'd be happy to provide you with an introduction to machine learning basics for beginners. Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn and make predictions or decisions without being explicitly programmed. Here are some key concepts and terms to help you get started:
Supervised Learning: In supervised learning, the machine learning algorithm learns from labeled training data. The training data consists of input examples and their corresponding correct output or target values. The algorithm learns to generalize from this data and make predictions or classify new, unseen examples.
Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. Unlike supervised learning, there are no target values provided. Instead, the algorithm aims to discover inherent structures or clusters in the data.
Training Data and Test Data: Machine learning models require a dataset to learn from. The dataset is typically split into two parts: the training data and the test data. The model learns from the training data, and the test data is used to evaluate its performance and generalization ability.
Features and Labels: In supervised learning, the input examples are often represented by features or attributes. For example, in a spam email classification task, features might include the presence of certain keywords or the length of the email. The corresponding output or target values are called labels, indicating the class or category to which the example belongs (e.g., spam or not spam).
Model Evaluation Metrics: To assess the performance of a machine learning model, various evaluation metrics are used. Common metrics include accuracy (the proportion of correctly predicted examples), precision (the proportion of true positives among all positive predictions), recall (the proportion of true positives predicted correctly), and F1 score (a combination of precision and recall).
Overfitting and Underfitting: Overfitting occurs when a model becomes too complex and learns to memorize the training data instead of generalizing well to unseen examples. On the other hand, underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Balancing the complexity of the model is crucial to achieve good generalization.
Feature Engineering: Feature engineering involves selecting or creating relevant features that can help improve the performance of a machine learning model. It often requires domain knowledge and creativity to transform raw data into a suitable representation that captures the important information.
Bias and Variance Trade-off: The bias-variance trade-off is a fundamental concept in machine learning. Bias refers to the errors introduced by the model's assumptions and simplifications, while variance refers to the model's sensitivity to small fluctuations in the training data. Reducing bias may increase variance and vice versa. Finding the right balance is important for building a well-performing model.
Supervised Learning Algorithms: There are various supervised learning algorithms, including linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. Each algorithm has its own strengths, weaknesses, and specific use cases.
Unsupervised Learning Algorithms: Unsupervised learning algorithms include clustering algorithms like k-means clustering and hierarchical clustering, dimensionality reduction techniques like principal component analysis (PCA) and t-SNE, and anomaly detection algorithms, among others.
These concepts provide a starting point for understanding the basics of machine learning. As you delve deeper, you can explore more advanced topics such as deep learning, reinforcement learning, and natural language processing. Remember to practice hands-on with real-world datasets to gain practical experience and further refine your skills.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is aggregated from sources such as
Entirely available in the public domain.
Resumes are usually in pdf format. OCR was used to convert the PDF into text and LLMs were used to convert the data into a structured format.
This dataset contains structured information extracted from professional resumes, normalized into multiple related tables. The data includes personal information, educational background, work experience, professional skills, and abilities.
Primary table containing core information about each individual.
| Column Name | Data Type | Description | Constraints | Example |
|---|---|---|---|---|
| person_id | INTEGER | Unique identifier for each person | Primary Key, Not Null | 1 |
| name | VARCHAR(255) | Full name of the person | May be Null | "Database Administrator" |
| VARCHAR(255) | Email address | May be Null | "john.doe@email.com" | |
| phone | VARCHAR(50) | Contact number | May be Null | "+1-555-0123" |
| VARCHAR(255) | LinkedIn profile URL | May be Null | "linkedin.com/in/johndoe" |
Detailed abilities and competencies listed by individuals.
| Column Name | Data Type | Description | Constraints | Example |
|---|---|---|---|---|
| person_id | INTEGER | Reference to people table | Foreign Key, Not Null | 1 |
| ability | TEXT | Description of ability | Not Null | "Installation and Building Server" |
Contains educational history for each person.
| Column Name | Data Type | Description | Constraints | Example |
|---|---|---|---|---|
| person_id | INTEGER | Reference to people table | Foreign Key, Not Null | 1 |
| institution | VARCHAR(255) | Name of educational institution | May be Null | "Lead City University" |
| program | VARCHAR(255) | Degree or program name | May be Null | "Bachelor of Science" |
| start_date | VARCHAR(7) | Start date of education | May be Null | "07/2013" |
| location | VARCHAR(255) | Location of institution | May be Null | "Atlanta, GA" |
Details of work experience entries.
| Column Name | Data Type | Description | Constraints | Example |
|---|---|---|---|---|
| person_id | INTEGER | Reference to people table | Foreign Key, Not Null | 1 |
| title | VARCHAR(255) | Job title | May be Null | "Database Administrator" |
| firm | VARCHAR(255) | Company name | May be Null | "Family Private Care LLC" |
| start_date | VARCHAR(7) | Employment start date | May be Null | "04/2017" |
| end_date | VARCHAR(7) | Employment end date | May be Null | "Present" |
| location | VARCHAR(255) | Job location | May be Null | "Roswell, GA" |
Mapping table connecting people to their skills.
| Column Name | Data Type | Description | Constraints | Example |
|---|---|---|---|---|
| person_id | INTEGER | Reference to people table | Foreign Key, Not Null | 1 |
| skill | VARCHAR(255) | Reference to skills table | Foreign Key, Not Null | "SQL Server" |
Master list of unique skills mentioned across all resumes.
| Column Name | Data Type | Description | Constraints | Example |
|---|---|---|---|---|
| skill | VARCHAR(255) | Unique skill name | Primary Key, Not Null | "SQL Server" |
-- Get all skills for a person
SELECT s.skill
FROM person_skills ps
JOIN skills s ON ps.skill = s.skill
WHERE ps.person_id = 1;
-- Get complete work history
SELECT *
FROM experience
WHERE person_id = 1
ORDER BY start_date DESC;
-- Most common skills
SELECT s.skill, COUNT(*) as frequency
FROM person_skills ps
...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains a collection of over 2,000 company documents, categorized into four main types: invoices, inventory reports, purchase orders, and shipping orders. Each document is provided in PDF format, accompanied by a CSV file that includes the text extracted from these documents, their respective labels, and the word count of each document. This dataset is ideal for various natural language processing (NLP) tasks, including text classification, information extraction, and document clustering.
PDF Documents: The dataset includes 2,677 PDF files, each representing a unique company document. These documents are derived from the Northwind dataset, which is commonly used for demonstrating database functionalities.
The document types are:
Here are a few example entries from the CSV file:
This dataset can be used for:
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains histopathological image data for the identification of colon cancer using deep learning. It includes high-resolution images labeled as cancerous or non-cancerous, intended for training and validating computer vision models in medical imaging.
The dataset is organised into two main image folders and two supporting CSV files:
├── train/ # 7,560 labelled images for training
├── test/ # 5,041 unlabeled images for inference/testing
├── train.csv # Contains image filenames and corresponding labels (for train/ folder)
├── example.csv # Sample format for custom data input
| Folder/File | Description |
|---|---|
train/ | Contains labeled histopathology images |
test/ | Contains images without labels for model inference |
train.csv | CSV file with two columns: image_id, label |
example.csv | A demonstration CSV with the expected structure |
Label Encoding:
Id → The Id of the ImageType → Cancer / Connective / Immune / NormalLoad the training labels:
import pandas as pd
df = pd.read_csv("train.csv")
print(df.head())
Read an image:
from PIL import Image
img = Image.open("train/image_00123.jpg")
img.show()
Uploaded by: Arpan Gupta
Full project using this dataset: GitHub Repo
Notebook Using Dataset: Kaggle
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset consists of a CSV file containing of 300 generated email spam messages. Each row in the file represents a separate email message, its title and text. The dataset aims to facilitate the analysis and detection of spam emails.
The dataset can be used for various purposes, such as training machine learning algorithms to classify and filter spam emails, studying spam email patterns, or analyzing text-based features of spam messages.
The data was generated using model text-davinci-003 Open AI API
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Fdefd7209a4510c98e556ca384c8ace68%2Finbox_618942_4d1fdedb2827152696dd0c0af05fd8da_f.png?generation=1695221394608089&alt=media" alt="">
includes the following information:
🚀 You can learn more about our high-quality unique datasets here
keywords: spam mails dataset, email spam classification, spam or not-spam, spam e-mail database, spam detection system, email spamming data set, spam filtering system, spambase, feature extraction, spam ham email dataset, classifier, machine learning algorithms, automated, generated data, synthetic data, synthetic data generation, synthetic dataset , cybersecurity, text dataset, sentiment analysis, llm dataset, language modeling, large language models, text classification, text mining dataset, natural language texts, nlp, nlp open-source dataset, text data