This dataset tracks the number of days since the row count on a dataset asset has changed. It's purpose is to ensure datasets are updating as expected. This dataset is identical to the Socrata Asset Inventory with added Checkpoint Date and Days Since Row Count Change attributes.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data was collected as a course project for the immersive data science course (by General Assembly and Misk Academy).
This dataset is in a CSV format, it consists of 5717 rows and 15 columns, where each row is a dataset on Kaggle and each column represents a feature of that dataset. |Feature|Description| |-------|-----------| |title| dataset name | |usability| dataset usability rating by Kaggle | |num_of_files| number of files associated with the dataset | |types_of_files| types of files associated with the dataset | |files_size| size of the dataset files | |vote_counts| total votes count by the dataset viewer | |medal| reward to popular datasets measured by the number of upvotes (votes by novices are excluded from medal calculation), [Bronze = 5 Votes, Silver = 20 Votes, Gold = 50 Votes] | |url_reference| reference to the dataset page on Kaggle in the format: www.kaggle.com/url_reference | |keywords| Topics tagged with the dataset | |num_of_columns| number of features in the dataset | |views| number of views | |downloads| number of downloads | |download_per_view| download per view ratio | |date_created| dataset creation date | |last_updated| date of the last update |
I would like to thank all my GA instructors for their continuous help and support
All data were taken from https://www.kaggle.com , collected on 30 Jan 2021
Using this dataset, we could try to predict the upcoming datasets uploaded, number of votes, number of downloads, medal type, etc.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 1 row and is filtered where the books is Count Draco down under. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Conceptual novelty analysis data based on PubMed Medical Subject Headings ---------------------------------------------------------------------- Created by Shubhanshu Mishra, and Vetle I. Torvik on April 16th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : the magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra. It contains final data generated as part of our experiments based on MEDLINE 2015 baseline and MeSH tree from 2015. The dataset is distributed in the form of the following tab separated text files: * PubMed2015_NoveltyData.tsv - Novelty scores for each paper in PubMed. The file contains 22,349,417 rows and 6 columns, as follow: - PMID: PubMed ID - Year: year of publication - TimeNovelty: time novelty score of the paper based on individual concepts (see paper) - VolumeNovelty: volume novelty score of the paper based on individual concepts (see paper) - PairTimeNovelty: time novelty score of the paper based on pair of concepts (see paper) - PairVolumeNovelty: volume novelty score of the paper based on pair of concepts (see paper) * mesh_scores.tsv - Temporal profiles for each MeSH term for all years. The file contains 1,102,831 rows and 5 columns, as follow: - MeshTerm: Name of the MeSH term - Year: year - AbsVal: Total publications with that MeSH term in the given year - TimeNovelty: age (in years since first publication) of MeSH term in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH term in the given year * meshpair_scores.txt.gz (36 GB uncompressed) - Temporal profiles for each MeSH term for all years - Mesh1: Name of the first MeSH term (alphabetically sorted) - Mesh2: Name of the second MeSH term (alphabetically sorted) - Year: year - AbsVal: Total publications with that MeSH pair in the given year - TimeNovelty: age (in years since first publication) of MeSH pair in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH pair in the given year * README.txt file ## Dataset creation This dataset was constructed using multiple datasets described in the following locations: * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * MeSH tree 2015: ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/ * Source code provided at: https://github.com/napsternxg/Novelty Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions: Additional data related updates can be found at: Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742 . The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Conceptual novelty analysis data based on PubMed Medical Subject Headings by Shubhanshu Mishra, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/Novelty
This dataset is made available under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). See LICENSE.pdf for details.
Dataset description
Parquet file, with:
35694 rows
154 columns
The file is indexed on [participant]_[month], such that 34_12 means month 12 from participant 34. All participant IDs have been replaced with randomly generated integers and the conversion table deleted.
Column names and explanations are included as a separate tab-delimited file. Detailed descriptions of feature engineering are available from the linked publications.
File contains aggregated, derived feature matrix describing person-generated health data (PGHD) captured as part of the DiSCover Project (https://clinicaltrials.gov/ct2/show/NCT03421223). This matrix focuses on individual changes in depression status over time, as measured by PHQ-9.
The DiSCover Project is a 1-year long longitudinal study consisting of 10,036 individuals in the United States, who wore consumer-grade wearable devices throughout the study and completed monthly surveys about their mental health and/or lifestyle changes, between January 2018 and January 2020.
The data subset used in this work comprises the following:
Wearable PGHD: step and sleep data from the participants’ consumer-grade wearable devices (Fitbit) worn throughout the study
Screener survey: prior to the study, participants self-reported socio-demographic information, as well as comorbidities
Lifestyle and medication changes (LMC) survey: every month, participants were requested to complete a brief survey reporting changes in their lifestyle and medication over the past month
Patient Health Questionnaire (PHQ-9) score: every 3 months, participants were requested to complete the PHQ-9, a 9-item questionnaire that has proven to be reliable and valid to measure depression severity
From these input sources we define a range of input features, both static (defined once, remain constant for all samples from a given participant throughout the study, e.g. demographic features) and dynamic (varying with time for a given participant, e.g. behavioral features derived from consumer-grade wearables).
The dataset contains a total of 35,694 rows for each month of data collection from the participants. We can generate 3-month long, non-overlapping, independent samples to capture changes in depression status over time with PGHD. We use the notation ‘SM0’ (sample month 0), ‘SM1’, ‘SM2’ and ‘SM3’ to refer to relative time points within each sample. Each 3-month sample consists of: PHQ-9 survey responses at SM0 and SM3, one set of screener survey responses, LMC survey responses at SM3 (as well as SM1, SM2, if available), and wearable PGHD for SM3 (and SM1, SM2, if available). The wearable PGHD includes data collected from 8 to 14 days prior to the PHQ-9 label generation date at SM3. Doing this generates a total of 10,866 samples from 4,036 unique participants.
Context & Motivation Electronic Health Records (EHR) are a cornerstone for modern healthcare analytics and machine-learning research—but real clinical data is sensitive, tightly regulated, and hard to share. To enable rapid prototyping, teaching, and multi-language experimentation without privacy concerns, we generated a synthetic, longitudinal EHR dataset in seven languages. Contents & Structure 100 K total records (10 K demo per language) Simulates multi-visit patients over a 10-year span Includes 16 core clinical variables: demographics (ID, sex, age), vitals, diagnosis (ICD-10), treatments, comorbidities, outcomes, and relapse risk All values are entirely artificial, statistically coherent but containing no real patient information Languages English, Spanish, French, Portuguese, Arabic, Hindi, Russian—ready for international ML, NLP, or data-science pipelines. Use Cases Quickly benchmark classification/regression models (risk prediction, outcome forecasting) Prototype dashboards or visualizations in any language Build multi-lingual NLP tools on synthetic clinical notes Educational labs, hackathons, or demos without GDPR/PHI hurdles Generation & Quality Data was simulated using Python (pandas, Faker) with realistic distributions for vital signs, diagnoses, and comorbidities. We ensured each language version uses its local terminology and character set, so you can test encoding, tokenization, and locale-sensitive pipelines. License & Access This demo is released under a custom MIT-style license (see LICENSE.txt). For the full 100 K-row dataset with extended documentation and variable dictionaries, visit our Gumroad page or contact em@sianabox.com.
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
A. SUMMARY
This dataset reports the number of new residential units made available for occupancy in San Francisco since January 2018. Each row in this dataset shows the change in the number of new units associated with a building permit application. Each row also includes the date those units were approved for occupancy, the type of document approving them, and their address.
Values in the column [Number of Units Certified] can be added together to produce a count of new units approved for occupancy since January 2018.
These records provide a preliminary count of new residential units. The San Francisco Planning Department issues a Housing Inventory Report each year that provides a more complete account of new residential units, and those results may vary slightly from records in this dataset. The Housing Inventory Report is an in-depth annual research project requiring extensive work to validate information about projects. By comparison, this dataset is meant to provide more timely updates about housing production based on available administrative data. The Department of Building Inspection and Planning Department will reconcile these records with future Housing Inventory Reports.
B. METHODOLOGY
At the end of each month, DBI staff manually calculate how many new units are available for occupancy for each building permit application and enters that information into this dataset. These records reflect counts for all types of residential units, including authorized accessory dwelling units. These records do not reflect units demolished or removed from the city’s available housing stock.
Multiple records may be associated with the same building permit application number, which means that new certifications or amendments were issued. Only changes to the net number of units associated with that permit application are recorded in subsequent records.
For example, Building Permit Application Number [201601010001] located at [123 1st Avenue] was issued an [Initial TCO] Temporary Certificate of Occupancy on [January 1, 2018] approving 10 units for occupancy. Then, an [Amended TCO] was issued on [June 1, 2018] approving [5] additional units for occupancy, for a total of 15 new units associated with that Building Permit Application Number. The building will appear as twice in the dataset, each row representing when new units were approved.
If additional or amended certifications are issued for a building permit application, but they do not change the number of units associated with that building permit application, those certifications are not recorded in this dataset. For example, if all new units associated with a project are certified for occupancy under an Initial TCO, then the Certificate of Final Completion (CFC) would not appear in the dataset because the CFC would not add new units to the housing stock. See data definitions for more details.
C. UPDATE FREQUENCY
This dataset is updated monthly.
D. DOCUMENT TYPES
Several documents issued near or at project completion can certify units for occupation. They are: Initial Temporary Certificate of Occupancy (TCO), Amended TCO, and Certificate of Final Completion (CFC).
• Initial TCO is a document that allows for occupancy of a unit before final project completion is certified, conditional on if the unit can be occupied safely. The TCO is meant to be temporary and has an expiration date. This field represents the number of units certified for occupancy when the TCO is issued. • Amended TCO is a document that is issued when the conditions of the project are changed before final project completion is certified. These records show additional new units that have become habitable since the issuance of the Initial TCO. • Certificate of Final Completion (CFC) is a document that is issued when all work is completed according to approved plans, and the building is ready for complete occupancy. These records show additional new units that were not accounted for in the Initial or Amended TCOs.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Tissue Cell Raw P Dataset
This dataset was processed using the parq2hug tool on 2025-06-15.
Dataset Information
Rows: 20,116 Columns: 195 File Size: 25.91 MB
File Structure
expression.parquet
expression.parquet contains the main dataset with 20,116 rows and 195 columns.
feature_metadata.parquet
feature_metadata.parquet contains metadata for each feature (column) in the dataset, including:
Feature name Data type Statistics (count, mean… See the full description on the dataset page: https://huggingface.co/datasets/longevity-db/tissue_cell_raw_p_dataset_new.
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
This dataset includes Point-in-Time (PIT) data collected in Cambridge between 2012 and 2024. The PIT count is a count of sheltered and unsheltered homeless persons on a single night in January. The U.S. Department of Housing and Urban Development (HUD) requires that communities receiving funding through the Continuum of Care (CoC) Program conduct an annual count of homeless persons on a single night in the last 10 days of January, and these data contribute to national estimates of homelessness reported in the Annual Homeless Assessment Report to the U.S. Congress. This dataset is comprised of data submitted to, and stored in, HUD’s Homelessness Data Exchange (HDX).
This dataset includes basic counts and demographic information of persons experiencing homelessness on each PIT date from 2012-2024. The dataset contains four rows for each year, including one row for each housing type: Emergency Shelter, Transitional Housing, or Unsheltered. The dataset also includes housing inventory counts of the number of shelter and transitional housing units available on each of the PIT count dates.
Information about persons staying in emergency shelters and transitional housing units is exported from the Homeless Management Information System (HMIS), which is the primary database for recording client-level service records. Information about persons in unsheltered situations is compiled by first conducting an overnight street count of persons observed sleeping outdoors on the PIT night to establish the total number of unsheltered persons. Demographic information for unsheltered persons is then extrapolated by utilizing assessment data collected by street outreach workers during the 7 days following the PIT count.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This respository includes two datasets, a Document-Term Matrix and associated metadata, for 17,493 New York Times articles covering protest events, both saved as single R objects.
These datasets are based on the original Dynamics of Collective Action (DoCA) dataset (Wang and Soule 2012; Earl, Soule, and McCarthy). The original DoCA datset contains variables for protest events referenced in roughly 19,676 New York Times articles reporting on collective action events occurring in the US between 1960 and 1995. Data were collected as part of the Dynamics of Collective Action Project at Stanford University. Research assistants read every page of all daily issues of the New York Times to find descriptions of 23,624 distinct protest events. The text for the news articles were not included in the original DoCA data.
We attempted to recollect the raw text in a semi-supervised fashion by matching article titles to create the Dynamics of Collective Action Corpus. In addition to hand-checking random samples and hand-collecting some articles (specifically, in the case of false positives), we also used some automated matching processes to ensure the recollected article titles matched their respective titles in the DoCA dataset. The final number of recollected and matched articles is 17,493.
We then subset the original DoCA dataset to include only rows that match a recollected article. The "20231006_dca_metadata_subset.Rds" contains all of the metadata variables from the original DoCA dataset (see Codebook), with the addition of "pdf_file" and "pub_title" which is the title of the recollected article (and may differ from the "title" variable in the original dataset), for a total of 106 variables and 21,126 rows (noting that a row is a distinct protest events and one article may cover more than one protest event).
Once collected, we prepared these texts using typical preprocessing procedures (and some less typical procedures, which were necessary given that these were OCRed texts). We followed these steps in this order: We removed headers and footers that were consistent across all digitized stories and any web links or HTML; added a single space before an uppercase letter when it was flush against a lowercase letter to its right (e.g., turning "JohnKennedy'' into "John Kennedy''); removed excess whitespace; converted all characters to the broadest range of Latin characters and then transliterated to ``Basic Latin'' ASCII characters; replaced curly quotes with their ASCII counterparts; replaced contractions (e.g., turned "it's'' into "it is''); removed punctuation; removed capitalization; removed numbers; fixed word kerning; applied a final extra round of whitespace removal.
We then tokenized them by following the rule that each word is a character string surrounded by a single space. At this step, each document is then a list of tokens. We count each unique token to create a document-term matrix (DTM), where each row is an article, each column is a unique token (occurring at least once in the corpus as a whole), and each cell is the number of times each token occurred in each article. Finally, we removed words (i.e., columns in the DTM) that occurred less than four times in the corpus as a whole or were only a single character in length (likely orphaned characters from the OCRing process). The final DTM has 66,552 unique words, 10,134,304 total tokens and 17,493. The "20231006_dca_dtm.Rds" is a sparse matrix class object from the Matrix R package.
In R, use the load() function to load the objects `dca_dtm` and `dca_meta`. To associate the `dca_meta` to the `dca_dtm` , match the "pdf_file" variable in`dca_meta` to the rownames of `dca_dtm`.
As part of the NASA's Making Earth System Data Records for Use in Research Environments (MEaSUREs) program, this project entitled “Multi-Decadal Nitrogen Dioxide and Derived Products from Satellites (MINDS)” will develop consistent long-term global trend-quality data records spanning the last two decades, over which remarkable changes in nitrogen oxides (NOx) emissions have occurred. The objective of the project Is to adapt Ozone Monitoring Instrument (OMI) operational algorithms to other satellite instruments and create consistent multi-satellite L2 and L3 nitrogen dioxide (NO2) columns and value-added L4 surface NO2 concentrations and NOx emissions data products, systematically accounting for instrumental differences. The instruments include Global Ozone Monitoring Experiment (GOME, 1996-2011), SCanning Imaging Absorption spectroMeter for Atmospheric CHartographY (SCIAMACHY, 2002-2012), OMI (2004-present), GOME-2 (2007-present), and TROPOspheric Monitoring Instrument (TROPOMI, 2018-present). The quality assured L2-L4 products will be made available to the scientific community via the NASA GES DISC website in Climate and Forecast (CF)-compliant Hierarchical Data Format (HDF5) and netCDF formats.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about universities in Chile. It has 16 rows. It features 2 columns including total students.
This dataset features over 4,000 customer reviews of Dermalogica cleansing exfoliators, all sourced from Ulta.com. It was compiled on 27 March 2023 using Python libraries, specifically designed for Natural Language Processing (NLP) tasks. The dataset provides valuable insights into customer opinions and product performance, making it ideal for various analytical applications.
The dataset contains over 4,000 individual reviews. Data was scraped on 27 March 2023. For the 'Verified_Buyer' column, there are 1,249 (30%) 'true' entries and 2,901 (70%) 'false' entries. Key product mentions are evenly split between 'Daily Superfoliant' and 'Daily Microfoliant', each at 36%. Review dates are categorised into "2 years ago" (22%), "1 year ago" (20%), and other periods (58%). Reviewer locations are largely "Undisclosed" (22%) or other (75%), with a small percentage (3%) from "Los Angeles". Specific numbers for rows or records beyond the total review count are not explicitly detailed, but metrics for upvotes and downvotes are available.
This dataset is particularly useful for: * Sentiment Analysis: Determining the overall positive or negative sentiments associated with each Dermalogica product. * Text Analysis: Extracting insights from review texts, such as common skincare concerns addressed by the products or issues they helped resolve or worsen. * Inferential Statistics: Analysing statistically significant differences in average sentiment scores across different product reviews. * Data Visualisation: Creating visual representations like bar plots or word clouds to highlight frequently used words or phrases in relation to specific products.
The data encompasses customer reviews of Dermalogica cleansing exfoliators published on Ulta.com. Geographically, while many reviewer locations are undisclosed, some specific cities like Los Angeles are noted, and the dataset is broadly considered to have a global reach. The time range of the reviews extends back from the data scrape date of 27 March 2023, with reviews published up to two years prior. No specific demographic breakdown is provided, though the 'Verified_Buyer' flag offers a binary indication of purchase confirmation.
CC-BY
This dataset is beneficial for a range of professionals and organisations, including: * Data Scientists and NLP Engineers: For developing and testing natural language processing models. * Market Researchers: To understand customer feedback, identify market trends, and assess product performance within the skincare industry. * Skincare Brands: For gaining insights into customer satisfaction, identifying product strengths and weaknesses, and informing product development strategies. * Academics and Students: For research projects focused on consumer behaviour, text analytics, or machine learning applications in e-commerce.
Original Data Source: NLP: NLP: Ulta Skincare Reviews
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context: Communication and mutual trust are keys driver for effective teamwork in human teams. In human-AI teams, teams composed of both humans and artificial agents, communication and trust are also important. In this research project, we investigated how different artificial agent’s communication affect human’s trust and satisfaction, in such teams. Participants teamed up with artificial agents in an online setting (using 2D grid world) and their decisions were be logged. This dataset includes different metrics calculated based on the logs, self-reported questionnaire answers on trust and satisfaction, and free answers to open questions.
This dataset was created during the Research Project course of the Computer Science Bachelor's in Delft University of Technology supervised by Carolina Jorge and Dr. Myrthe Tielman. Five students ran a user study with six different conditions (the baseline, and five new developed by each of them). The full description of the user study and their individual results (i.e., pairwise comparison between their own condition and baseline) can be found in each of their thesis, linked in this page below.
Then, a full joint dataset was created and it can be found in "Full dataset.csv" (total 140 rows). To balance the number of participants per condition, we generated a "capped_dataset.csv" with 20 rows per condition (total N=120). We analysed differences among conditions, and rerun the pairwise comparisons, of "capped_dataset.csv". The code can be found in "Quantitative Analysis.ipynb". These results are to be published in a paper - the author contributions can be found in "author_contribution.txt".
The full code used for the generation of this dataset can be found in this Github repository: https://github.com/centeio/AT-Communication
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
------------------------------------------------------------------------------------------------------------- CITATION ------------------------------------------------------------------------------------------------------------- Please cite this data and code as: H. Khamis, R. Weiss, Y. Xie, C-W. Chang, N. H. Lovell, S. J. Redmond, "QRS detection algorithm for telehealth electrocardiogram recordings," IEEE Transaction in Biomedical Engineering, vol. 63(7), p. 1377-1388, 2016. ------------------------------------------------------------------------------------------------------------- DATABASE DESCRIPTION ------------------------------------------------------------------------------------------------------------- The following description of the TELE database is from Khamis et al (2016): "In Redmond et al (2012), 300 ECG single lead-I signals recorded in a telehealth environment are described. The data was recorded using the TeleMedCare Health Monitor (TeleMedCare Pty. Ltd. Sydney, Australia). This ECG is sampled at a rate of 500 Hz using dry metal Ag/AgCl plate electrodes which the patient holds with each hand; a reference electrode plate is also positioned under the pad of the right hand. Of the 300 recordings, 250 were selected randomly from 120 patients, and the remaining 50 were manually selected from 168 patients to obtain a larger representation of poor quality data. Three independent scorers annotated the data by identifying sections of artifact and QRS complexes. All scorers then annotated the signals as a group, to reconcile the individual annotations. Sections of the ECG signal which were less than 5 s in duration were considered to be part of the neighboring artifact sections and were subsequently masked. QRS annotations in the masked regions were discarded prior to the artifact mask and QRS locations being saved. Of the 300 telehealth ECG records in Redmond et al. (2012), 50 records (including 29 of the 250 randomly selected records and 21 of the 50 manually selected records) were discarded as all annotated RR intervals within these records overlap with the annotated artifact mask and therefore, no heart rate can be calculated, which is required for measuring algorithm performance. The remaining 250 records will be referred to as the TELE database." For all 250 recordings in the TELE database, the mains frequency was 50 Hz, the sampling frequency was 500 Hz and the top and bottom rail voltages were 5.556912223578890 and -5.554198887532222 mV respectively. ------------------------------------------------------------------------------------------------------------- DATA FILE DESCRIPTION ------------------------------------------------------------------------------------------------------------- Each record in the TELE database is stored as a X_Y.dat file where X indicates the index of the record in the TELE database (containing a total of 250 records) and Y indicates the index of the record in the original dataset containing 300 records (see Redmond et al. 2012). The .dat file is a comma separated values file. Each line contains: - the ECG sample value (mV) - a boolean indicating the locations of the annotated qrs complexes - a boolean indicating the visually determined mask - a boolean indicating the software determined mask (see Khamis et al. 2016) ------------------------------------------------------------------------------------------------------------- CONVERTING DATA TO MATLAB STRUCTURE ------------------------------------------------------------------------------------------------------------- A matlab function (readFromCSV_TELE.m) has been provided to read the .dat files into a matlab structure: %% % [DB,fm,fs,rail_mv] = readFromCSV_TELE(DATA_PATH) % % Extracts the data for each of the 250 telehealth ECG records of the TELE database [1] % and returns a structure containing all data, annotations and masks. % % IN: DATA_PATH - String. The path containing the .hdr and .dat files % % OUT: DB - 1xM Structure. Contains the extracted data from the M (250) data files. % The structure has fields: % * data_orig_ind - 1x1 double. The index of the data file in the original dataset of 300 records (see [1]) - for tracking purposes. % * ecg_mv - 1xN double. The ecg samples (mV). N is the number of samples for the data file. % * qrs_annotations - 1xN double. The qrs complexes - value of 1 where a qrs is located and 0 otherwise. % * visual_mask - 1xN double. The visually determined artifact mask - value of 1 where the data is masked and 0 otherwise. % * software_mask - 1xN double. The software artifact mask - value of 1 where the data is masked and 0 otherwise. % fm - 1x1 double. The mains frequency (Hz) % fs - 1x1 double. The sampling frequency (Hz) % rail_mv - 1x2 double. The bottom and top rail voltages (mV) % % If you use this code or data, please cite as follows: % % [1] H. Khamis, R. Weiss, Y. Xie, C-W. Chang, N. H. Lovell, S. J. Redmond, % "QRS detection algorithm...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper explores a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university is presented in Appendix 1. The dataset is unique for several reasons. It covers all SET surveys filled by students in all fields and levels of study offered by the university. In the period analysed, the university was entirely in the online regime amid the Covid-19 pandemic. While the expected learning outcomes formally have not been changed, the online mode of study could have affected the grading policy and could have implications for some of the studied SET biases. This Covid-19 effect is captured by econometric models and discussed in the paper. The average SET scores were matched with the characteristics of the teacher for degree, seniority, gender, and SET scores in the past six semesters; the course characteristics for time of day, day of the week, course type, course breadth, class duration, and class size; the attributes of the SET survey responses as the percentage of students providing SET feedback; and the grades of the course for the mean, standard deviation, and percentage failed. Data on course grades are also available for the previous six semesters. This rich dataset allows many of the biases reported in the literature to be tested for and new hypotheses to be formulated, as presented in the introduction section. The unit of observation or the single row in the data set is identified by three parameters: teacher unique id (j), course unique id (k) and the question number in the SET questionnaire (n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9} ). It means that for each pair (j,k), we have nine rows, one for each SET survey question, or sometimes less when students did not answer one of the SET questions at all. For example, the dependent variable SET_score_avg(j,k,n) for the triplet (j=Calculus, k=John Smith, n=2) is calculated as the average of all Likert-scale answers to question nr 2 in the SET survey distributed to all students that took the Calculus course taught by John Smith. The data set has 8,015 such observations or rows. The full list of variables or columns in the data set included in the analysis is presented in the attached filesection. Their description refers to the triplet (teacher id = j, course id = k, question number = n). When the last value of the triplet (n) is dropped, it means that the variable takes the same values for all n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9}.Two attachments:- word file with variables description- Rdata file with the data set (for R language).Appendix 1. Appendix 1. The SET questionnaire was used for this paper. Evaluation survey of the teaching staff of [university name] Please, complete the following evaluation form, which aims to assess the lecturer’s performance. Only one answer should be indicated for each question. The answers are coded in the following way: 5- I strongly agree; 4- I agree; 3- Neutral; 2- I don’t agree; 1- I strongly don’t agree. Questions 1 2 3 4 5 I learnt a lot during the course. ○ ○ ○ ○ ○ I think that the knowledge acquired during the course is very useful. ○ ○ ○ ○ ○ The professor used activities to make the class more engaging. ○ ○ ○ ○ ○ If it was possible, I would enroll for the course conducted by this lecturer again. ○ ○ ○ ○ ○ The classes started on time. ○ ○ ○ ○ ○ The lecturer always used time efficiently. ○ ○ ○ ○ ○ The lecturer delivered the class content in an understandable and efficient way. ○ ○ ○ ○ ○ The lecturer was available when we had doubts. ○ ○ ○ ○ ○ The lecturer treated all students equally regardless of their race, background and ethnicity. ○ ○
See "About" for field info. This dataset tracks the speed the city responds to public records requests.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS certificate fields, and GeoIP information for 432,572 verified benign domains from Cisco Umbrella and 36,993 verified phishing domains from PhishTank and OpenPhish services. The dataset is useful for statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing detection. The data was collected between March and July 2023.The final assessment of the data was conducted in July 2023 (this is why the names are suffixed with _2307).
The upload contains: a) data files, b) the description of the data structure, and c) the veature vector we used for ML-based phishing domain detection.
The data is located in two individual files:
Both files are in the JSON Array format. The structure is as follows:
[
{
"_id" : "A unique ID of the data record",
"domain_name" : "Name of the domain (e.g., zenodo.com)",
"dns" : { "//": "Data obtained from DNS records" },
"evaluated_on" : "// ISO Timestamp of data collection ",
"ip_data" : [ "// Data for each related IP adddress ",
{
"//": "IP-related data, including RTT from ICMP echo attempts (from Brno, Czechia)",
"//": "WHOIS/RDAP data for the given IP address",
"//": "GeoIP data for the given IP address",
"//": "NERD system reputation score (if available)",
"//": "ASN info",
"//": "remarks: ISO timestamps of collection of the individual data pieces"
},
],
"label" : "benign_2307 for benign OR misp_2307 for phishing",
"rdap" : { "//": "WHOIS/RDAP information for the domain name" },
"remarks" : {
"dns_evaluated_on" : "ISO Timestamp of DNS data collection",
"rdap_evaluated_on" : "ISO Timestamp of WHOIS/RDAP data collection",
"tls_evaluated_on" : "ISO Timestamp of TLS certificate information collection",
"dns_had_no_ips" : "true if no IPs were found in DNS records"
},
"sourced_on" : "ISO Timestamp of the moment the domain was found",
"tls" : {
"cipher" : "Identifier of the TLS cipher suite",
"count" : "Number of certificates in chain",
"protocol" : "Version of the TLS protocol",
"certificates" : [
"//": "Information from TLS certificate fields: issuer, extensions, etc."
]
},
"category" : "Category of the record (could be ignored)",
"source" : "Name of the file that we used to save the domain list"
}
]
This section describes the veature vector used in the "Unmasking the Phishermen: Phishing Domain Detection with Machine Learning and Multi-Source Intelligence" paper that was accepted to the IEEE NOMS 2024 conference.
The following features were extracted from the sole domain name:
The following features were extracted from DNS responses when querying about the domain:
These features were derived from IP addresses and ICMP echo replies:
The following features were extracted from TLS certificate chains and TLS handshakes:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A 10,000-patient database that contains in total 10,000 virtual patients, 36,143 admissions, and 10,726,505 lab observations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Count the ways : the greatest love stories of our time. It features 7 columns including author, publication date, language, and book publisher.
This dataset tracks the number of days since the row count on a dataset asset has changed. It's purpose is to ensure datasets are updating as expected. This dataset is identical to the Socrata Asset Inventory with added Checkpoint Date and Days Since Row Count Change attributes.