Facebook
TwitterDataset Title: Data and Code for: "Universal Adaptive Normalization Scale (AMIS): Integration of Heterogeneous Metrics into a Unified System" Description: This dataset contains source data and processing results for validating the Adaptive Multi-Interval Scale (AMIS) normalization method. Includes educational performance data (student grades), economic statistics (World Bank GDP), and Python implementation of the AMIS algorithm with graphical interface. Contents: - Source data: educational grades and GDP statistics - AMIS normalization results (3, 5, 9, 17-point models) - Comparative analysis with linear normalization - Ready-to-use Python code for data processing Applications: - Educational data normalization and analysis - Economic indicators comparison - Development of unified metric systems - Methodology research in data scaling Technical info: Python code with pandas, numpy, scipy, matplotlib dependencies. Data in Excel format.
Facebook
TwitterThis dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.
About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.
Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.
This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.
This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This updated version includes a Python script (glucose_analysis.py) that performs statistical evaluation of the glucose normalization process described in the associated thesis. The script supports key analyses, including normality assessment (Shapiro–Wilk test), variance homogeneity (Levene’s test), mean comparison (ANOVA), effect size estimation (Cohen’s d), and calculation of confidence intervals for the mean difference. These results validate the impact of Min-Max normalization on clinical data structure and usability within CDSS workflows. The script is designed to be reproducible and complements the processed dataset already included in this repository.
Facebook
TwitterAdditional file 8 Supplementary Table 6. MeSH term number in each category correctly identified by pyMeSHSim, Dnorm, TaggerOne and Nelson’s manual work.
Facebook
TwitterFace recognition is a popular computer vision application that allows machines to identify and verify human faces from images or videos. Python is a widely used programming language for implementing face recognition systems due to its simplicity, flexibility, and availability of powerful libraries such as OpenCV, Dlib, and TensorFlow.
Here's a professional description of a face recognition project in Python:
Dataset collection: Collect a dataset of facial images to train the model. This can be done using publicly available datasets such as LFW, CelebA, or private data.
Preprocessing: Preprocess the dataset to improve model accuracy. This includes face detection, alignment, and normalization.
Feature extraction: Extract features from the preprocessed facial images using a pre-trained deep neural network such as VGG or ResNet. This will transform each face image into a feature vector that represents the unique characteristics of the face.
Training: Train a machine learning model such as a support vector machine (SVM) or a neural network using the extracted features and corresponding labels. The model should be optimized to minimize false positives and false negatives.
Testing: Evaluate the trained model on a test dataset to measure its performance. This can be done using metrics such as accuracy, precision, and recall.
Deployment: Deploy the model to a production environment where it can be used to recognize faces in real-time. This can be done using a web-based interface or a standalone application.
Improvements: Continuously improve the model by adding new data, refining the preprocessing steps, and tuning the model hyperparameters.
Some additional advanced techniques that can be used to improve face recognition include:
Face recognition with deep learning: Use deep learning techniques such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs) to train more accurate models.
Multi-face recognition: Train models to recognize multiple faces in an image or video stream.
Face recognition with privacy protection: Incorporate privacy protection techniques such as blurring or anonymization of facial features to protect personal information.
Overall, a face recognition project in Python involves collecting and preprocessing data, extracting features, training and evaluating machine learning models, deploying the model in a production environment, and continuously improving the accuracy and efficiency of the system.
Facebook
TwitterIn this project, I have done exploratory data analysis on the UCI Automobile dataset available at https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
This dataset consists of data From the 1985 Ward's Automotive Yearbook. Here are the sources
1) 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook. 2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038 3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037
Number of Instances: 398 Number of Attributes: 9 including the class attribute
Attribute Information:
mpg: continuous cylinders: multi-valued discrete displacement: continuous horsepower: continuous weight: continuous acceleration: continuous model year: multi-valued discrete origin: multi-valued discrete car name: string (unique for each instance)
This data set consists of three types of entities:
I - The specification of an auto in terms of various characteristics
II - Tts assigned an insurance risk rating. This corresponds to the degree to which the auto is riskier than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is riskier (or less), this symbol is adjusted by moving it up (or down) the scale. Actuaries call this process "symboling".
III - Its normalized losses in use as compared to other cars. This is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/specialty, etc...), and represents the average loss per car per year.
The analysis is divided into two parts:
Data Wrangling
Exploratory Data Analysis
Descriptive statistics
Groupby
Analysis of variance
Correlation
Correlation stats
Acknowledgment Dataset: UCI Machine Learning Repository Data link: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
Facebook
TwitterAdditional file 4 Supplementary Table 2. GWAS phenotypes parsed by Nelson’s group and pyMeSHSim, and the semantic similarity between them calculated by pyMeSHSim and meshes.
Facebook
TwitterAdditional file 10 : Supplementary Table 8. DNorm or TaggerOne perfectly recognized MeSH terms, but pyMeSHSim failed. The semantic similarity between them calculated by pyMeSHSim. pyMeSHSim_Score is semantic similarity between Nelson_MeSH _ID and pyMeSHSim_MeSH_ID, taggerOne_score is semantic similarity between Nelson_MeSH _ID and TaggerOne_MeSH_ID, DNorm_score is semantic similarity between Nelson_MeSH _ID and Dnorm_MeSH_ID.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises a professions gazetteer generated with automatically extracted terminology from the Mesinesp2 corpus, a manually annotated corpus in which domain experts have labeled a set of scientific literature, clinical trials, and patent abstracts, as well as clinical case reports.
A silver gazetteer for mention classification and normalization is created combining the predictions of automatic Named Entity Recognition models and normalization using Entity Linking to three controlled vocabularies SNOMED CT, NCBI and ESCO. The sources are 265,025 different documents, where 249,538 correspond to MESINESP2 Corpora and 15,487 to clinical cases from open clinical journals. From them, 5,682,000 mentions are extracted and 4,909,966 (86.42%) are normalized to any of the ontologies: SNOMED CT (4,909,966) for diseases, symptoms, drugs, locations, occupations, procedures and species; ESCO (215,140) for occupations; and NCBI (1,469,256) for species.
The repository contains a .tsv file with the following columns:
filenameid: A unique identifier combining the file name and mention span within the text. This ensures each extracted mention is uniquely traceable. Example: biblio-1000005#239#256 refers to a mention spanning characters 239–256 in the file with the name biblio-1000005.
span: The specific text span (mention) extracted from the document, representing a term or phrase identified in the dataset. Example: centro oncológico.
source: The origin of the document, indicating the corpus from which the mention was extracted. Possible values: mesinesp2, clinical_cases.
filename: The name of the file from which the mention was extracted. Example: biblio-1000005.
mention_class: Categories or semantic tags assigned to the mention, describing its type or context in the text. Example: ['ENFERMEDAD', 'SINTOMA'].
codes_esco: The normalized ontology codes from the European Skills, Competences, Qualifications, and Occupations (ESCO) vocabulary for the identified mention (if applicable). This field may be empty if no ESCO mapping exists. Example: 30629002.
terms_esco: The human-readable terms from the ESCO ontology corresponding to the codes_esco. Example: ['responsable de recursos', 'director de recursos', 'directora de recursos'].
codes_ncbi: The normalized ontology codes from the NCBI Taxonomy vocabulary for species (if applicable). This field may be empty if no NCBI mapping exists.
terms_ncbi: The human-readable terms from the NCBI Taxonomy vocabulary corresponding to the codes_ncbi. Example: ['Lacandoniaceae', 'Pandanaceae R.Br., 1810', 'Pandanaceae', 'Familia'].
codes_sct: The normalized ontology codes from SNOMED CT (Systematized Nomenclature of Medicine - Clinical Terms) vocabulary for diseases, symptoms, drugs, locations, occupations, procedures, and species (if applicable). Example: 22232009.
terms_sct: The human-readable terms from the SNOMED CT ontology corresponding to the codes_sct. Example: ['adjudicador de regulaciones del seguro nacional'].
sct_sem_tag: The semantic category tag assigned by SNOMED CT to describe the general classification of the mention. Example: environment.
Suggestion: If you load the dataset using python, it is recommended to read the columns containing lists as follows
import ast
df["mention_class"] = df["mention_class"].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
License
This dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0). This means you are free to:
Share: Copy and redistribute the material in any medium or format.
Adapt: Remix, transform, and build upon the material for any purpose, even commercially.
Attribution Requirement: Please credit the dataset creators appropriately, provide a link to the license, and indicate if changes were made.
Contact
If you have any questions or suggestions, please contact us at:
Martin Krallinger ()
Additional resources and corpora
If you are interested, you might want to check out these corpora and resources:
MESINESP-2 (Corpus of manually indexed records with DeCS /MeSH terms comprising scientific literature abstracts, clinical trials, and patent abstracts, different document collection)
MEDDOPROF corpus
Codes Reference List (for MEDDOPROF-NORM)
Annotation Guidelines
Occupations Gazetteer
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Abstract: The dataset contains measurements of magnetic susceptibility in dependence of temperature of shocked magnetite and of a natural magnetite single crystal before and after manual crushing. A python code for evaluation of low-temperature susceptibility curves is included. The data are supplementary to: Fuchs, H., Kontny, A. and Schilling, F.R., 2024. Stress-induced Changes in Magnetite: Insights from a Numerical Analysis of the Verwey Transition, Geophysical Journal International TechnicalRemarks: The data set contains k-T curves of - Initial magnetite ore from Sydvaranger mine (Norway) - the same ore after shock at 3, 5, 10, 20 and 30 GPa under laboratory conditions and after subsequent heating to 973 K -Natural magnetite single crystal (initial and after manual crushing) The data set contains a python code for evaluation of normalized low-temperature k-T curves. Experimental conditions are described in [1]. The approach for k-T curve evaluation is described in [2] [1]: Kontny, A., Reznik, B., Boubnov, A., Göttlicher, J. and Steininger, R., 2018. Postshock Thermally Induced Transformations in Experimentally Shocked Magnetite, Geochemistry, Geophysics, Geosystems, Vol. 19, 3, pp. 921–931, doi:10.1002/2017GC007331. [2] Fuchs, H., Kontny, A. and Schilling, F.R., 2024. Stress-induced Changes in Magnetite: Insights from a Numerical Analysis of the Verwey Transition, Geophysical Journal International
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
This dataset investigates the relationship between Wordle answers and Google search spikes, particularly for uncommon words. It spans from June 21, 2021 to June 24, 2025.
It includes daily data for each Wordle answer, its search trend on that day, and frequency-based commonality indicators.
Each Wordle answer causes a spike in search volume on the day it appears — more so if the word is rare.
This dataset supports exploration of:
| Column | Description |
|---|---|
date | Date of the Wordle puzzle |
word | Correct 5-letter Wordle answer |
game | Wordle game number |
wordfreq_commonality | Normalized frequency score using Python’s wordfreq library |
subtlex_commonality | Normalized frequency score using SUBTLEX-US dataset |
trend_day_global | Google search interest on the day (global, all categories) |
trend_avg_200_global | 200-day average search interest (global, all categories) |
trend_day_language | Search interest on Wordle day (Language Resources category) |
trend_avg_200_language | 200-day average search interest (Language Resources category) |
Notes: - All trend values are relative (0–100 scale, per Google Trends)
wordfreq Python librarypytrendsCan find analysis done using this data in the blog post
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains R and Jupyter code and data for making statistical analyses and create NMDS and Heatmap figures for EcoFAB 2.0 Ring Trial. "two-way-anova-ecofab-ringtrial-stats.ipynb" generates statistical analysis utilizing RawData from the paper."Heat_KEGG.r" generates combative genomics heat map utilizing normalized KEGG pathway gene abundance in "Heat_KEGG.xlsx" dataset."Heat_TM.r" generates heat map plot for targeted metabolomics utilizing normalized metabolite intensity "Heat_TM.xlsx" dataset"NMDS_UM.r" generates NMDS plots for untargeted metabolomics utilizing raw peak heights for detected features in "NMDS_UM.xlsx" and "NMDS_UM1.xlsx" datasets"NMDS_seq.r" generates NMDS plots for root and media microbiome composition utilizing relative bacterial abundances from 16S rRNA sequencing in "Seq_media.xlsx" and "Seq_root" datasets.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset provides a step-by-step pipeline for preprocessing metabolomics data.
The pipeline implements Probabilistic Quotient Normalization (PQN) to correct dilution effects in metabolomics measurements.
Includes guidance on handling raw metabolomics datasets obtained from LC-MS or NMR experiments.
Demonstrates Principal Component Analysis (PCA) for dimensionality reduction and exploratory data analysis.
Includes data visualization techniques to interpret PCA results effectively.
Suitable for metabolomics researchers and data scientists working on omics data.
Enables better reproducibility of preprocessing workflows for metabolomics studies.
Can be used to normalize data, detect outliers, and identify major patterns in metabolomics datasets.
Provides a Python-based notebook that is easy to adapt to new datasets.
Includes example datasets and code snippets for immediate application.
Helps users understand the impact of normalization on downstream statistical analyses.
Supports integration with other metabolomics pipelines or machine learning workflows.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Task scheduler performance survey
This dataset contains results of task graph scheduler performance survey.
The results are stored in the following files, which correspond to simulations performed on
the elementary, irw and pegasus task graph datasets published at https://doi.org/10.5281/zenodo.2630384.
elementary-result.zip
irw-result.zip
pegasus-result.zip
The files contain compressed pandas dataframes in CSV format, it can be read with the following Python code:
python
import pandas as pd
frame = pd.read_csv("elementary-result.zip")
Each row in the frame corresponds to a single instance of a task graph that was simulated with a specific configuration (network model, scheduler etc.). The list below summarizes the meaning of the individual columns.
graph_name - name of the benchmarked task graph
graph_set - name of the task graph dataset from which the graph originates
graph_id - unique ID of the graph
cluster_name - type of cluster used in this instance the format is x; 32x16 means 32 workers, each with 16 cores
bandwidth - network bandwidth [MiB]
netmodel - network model (simple or maxmin)
scheduler_name - name of the scheduler
imode - information mode
min_sched_interval - minimal scheduling delay [s]
sched_time - duration of each scheduler invocation [s]
time - simulated makespan of the task graph execution [s]
execution_time - real duration of all scheduler invocations [s]
total_transfer - amount of data transferred amongst workers [MiB]
The file charts.zip contains charts obtained by processing the datasets.
On the X axis there is always bandwidth in [MiB/s].
There are the following files:
[DATASET]-schedulers-time - Absolute makespan produced by schedulers [seconds]
[DATASET]-schedulers-score - The same as above but normalized with respect to the best schedule (shortest makespan) for the given configuration.
[DATASET]-schedulers-transfer - Sums of transfers between all workers for a given configuration [MiB]
[DATASET]-[CLUSTER]-netmodel-time - Comparison of netmodels, absolute times [seconds]
[DATASET]-[CLUSTER]-netmodel-score - Comparison of netmodels, normalized to the average of model "simple"
[DATASET]-[CLUSTER]-netmodel-transfer - Comparison of netmodels, sum of transfered data between all workers [MiB]
[DATASET]-[CLUSTER]-schedtime-time - Comparison of MSD, absolute times [seconds]
[DATASET]-[CLUSTER]-schedtime-score - Comparison of MSD, normalized to the average of "MSD=0.0" case
[DATASET]-[CLUSTER]-imode-time - Comparison of Imodes, absolute times [seconds]
[DATASET]-[CLUSTER]-imode-score - Comparison of Imodes, normalized to the average of "exact" imode
Reproducing the results
$ git clone https://github.com/It4innovations/estee $ cd estee $ pip install .
benchmarks/generate.py to generate graphs
from three categories (elementary, irw and pegasus):$ cd benchmarks $ python generate.py elementary.zip elementary $ python generate.py irw.zip irw $ python generate.py pegasus.zip pegasus
or use our task graph dataset that is provided at https://doi.org/10.5281/zenodo.2630384.
benchmark.json. Then you can run the benchmark using this command:$ python pbs.py compute benchmark.json
The benchmark script can be interrupted at any time (for example using Ctrl+C). When interrupted, it will store the computed results to the result file and restore the computation when launched again.
$ python view.py --all
The resulting plots will appear in a folder called outputs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract:
The Cancer Genome Atlas (TCGA) was a large-scale collaborative project initiated by the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI). It aimed to comprehensively characterize the genomic and molecular landscape of various cancer types. This dataset contains information about GBM, an aggressive and highly malignant brain tumor that arises from glial cells, characterized by rapid growth and infiltrative behavior. The gene expression profile was measured experimentally using the Affymetrix HT Human Genome U133a microarray platform by the Broad Institute of MIT and Harvard University cancer genomic characterization center. The Sample IDs serve as unique identifiers for each sample.
Inspiration:
This dataset was uploaded to UBRITE for GTKB project.
Instruction:
The log2(x) normalization was removed, and z-normalization was performed on the dataset using a Python script.
Acknowledgments:
Goldman, M.J., Craft, B., Hastie, M. et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat Biotechnol (2020). https://doi.org/10.1038/s41587-020-0546-8
The Cancer Genome Atlas Research Network., Weinstein, J., Collisson, E. et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet 45, 1113–1120 (2013). https://doi.org/10.1038/ng.2764
U-BRITE last update: 07/13/2023
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Imagine you are working as a data scientist at Zomato. Your goal is to enhance operational efficiency and improve customer satisfaction by analyzing food delivery data. You need to build an interactive Streamlit tool that enables seamless data entry for managing orders, customers, restaurants, and deliveries. The tool should support robust database operations like adding columns or creating new tables dynamically while maintaining compatibility with existing code. ##Business_Use_Cases: Order Management: Identifying peak ordering times and locations. Tracking delayed and canceled deliveries. Customer Analytics: Analyzing customer preferences and order patterns. Identifying top customers based on order frequency and value. Delivery Optimization: Analyzing delivery times and delays to improve logistics. Tracking delivery personnel performance. Restaurant Insights: Evaluating the most popular restaurants and cuisines. Monitoring order values and frequency by restaurant.
#Approach: 1) Dataset Creation: Use Python (Faker) to generate synthetic datasets for customers, orders, restaurants, and deliveries. Populate the SQL database with these datasets. 2) Database Design: Create normalized SQL tables for Customers, Orders, Restaurants, and Deliveries. Ensure compatibility for dynamic schema changes (e.g., adding columns, creating new tables). 3) Data Entry Tool: Develop a Streamlit app for: Adding, updating, and deleting records in the SQL database. Dynamically creating new tables or modifying existing ones. 4) Data Insights: Use SQL queries and Python to extract insights like peak times, delayed deliveries, and customer trends. Visualize the insights in the Streamlit app.(Add on) 5) OOP Implementation: Encapsulate database operations in Python classes. Implement robust and reusable methods for CRUD (Create, Read, Update, Delete) operations. 6) Order Management: Identifying peak ordering times and locations. Tracking delayed and canceled deliveries. 7) Customer Analytics: Analyzing customer preferences and order patterns. Identifying top customers based on order frequency and value.
8) Delivery Optimization: Analyzing delivery times and delays to improve logistics. Tracking delivery personnel performance. 9) Restaurant Insights: Evaluating the most popular restaurants and cuisines. Monitoring order values and frequency by restaurant.
**##Results: ** By the end of this project, learners will achieve: A fully functional SQL database for managing food delivery data. An interactive Streamlit app for data entry and analysis. Should write 20 sql queries and do analysis. Dynamic compatibility with database schema changes. Comprehensive insights into order trends, delivery performance, and customer behavior.
##Project Evaluation metrics: Database Design: Proper normalization of tables and relationships between them. Code Quality: Use of OOP principles to ensure modularity and scalability. Robust error handling for database operations. Streamlit App Functionality: Usability of the interface for data entry and insights. Compatibility with schema changes. Data Insights: Use 20 sql queries for data analysis Documentation: Clear and comprehensive explanation of the code and approach.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
📄 Kaggle Dataset Description (FAERS Signals) Title
FDA FAERS Adverse Drug Event Signals (Processed)
Subtitle
Drug–Adverse Event counts and disproportionality metrics (PRR, ROR) from the FDA’s Adverse Event Reporting System (FAERS).
🧾 Overview
The FDA Adverse Event Reporting System (FAERS) is a publicly available database of adverse drug event reports, medication error reports, and product quality complaints. This dataset provides processed, analysis-ready FAERS data, focusing on drug–adverse event pairs with quarterly counts and basic signal detection metrics.
📊 What’s Inside? faers_drug_event_counts.csv
Clean, normalized table of drug–event pairs
Quarterly (QTR) counts of adverse events
faers_signals_prr_ror.csv
Proportional Reporting Ratio (PRR) and Reporting Odds Ratio (ROR) for each drug–event pair
Simple thresholds applied (min. count filter)
🔍 Potential Use Cases
Pharmacovigilance signal detection
Drug safety surveillance Predictive modeling of future label changes
Text/data mining for biomedical research
Event-driven investment research (biopharma risk signals)
⚠️ Limitations
FAERS is a spontaneous reporting system (subject to underreporting, duplication, and reporting bias).
Counts do not equal incidence rates.
Use this data for signal detection, not risk quantification.
This dataset is processed for Kaggle use and may not contain all FAERS fields.
📚 Source & License
Source: FDA FAERS Public Data
License: US Government Work (Public Domain)
🔥 This dataset bridges raw FDA data and ML-ready inputs, helping researchers, data scientists, and regulatory experts run faster signal detection workflows.
This dataset contains processed outputs from the FDA Adverse Event Reporting System (FAERS).
It provides cleaned quarterly counts of drug–event pairs along with disproportionality metrics such as PRR (Proportional Reporting Ratio) and ROR (Reporting Odds Ratio).
faers_drug_event_counts.csv
Raw counts of drug–event pairs per quarter.
faers_signals_prr_ror.csv
Signal detection metrics (PRR, ROR) with thresholds applied.
faers_drug_event_counts.csv| Column | Description |
|---|---|
| DRUGNAME_NORM | Normalized drug name |
| QTR | Report quarter (YYYYQn) |
| PT_NORM | MedDRA Preferred Term (adverse event) |
| n_reports | Number of case reports |
| quarter_folder | Source folder of ASCII data |
faers_signals_prr_ror.csv| Column | Description |
|---|---|
| DRUGNAME_NORM | Normalized drug name |
| PT_NORM | MedDRA Preferred Term |
| n_reports | Case counts |
| PRR | Proportional Reporting Ratio |
| ROR | Reporting Odds Ratio |
| PRR_signal | Boolean flag if PRR > threshold |
| ROR_signal | Boolean flag if ROR > threshold |
import pandas as pd
# Load drug-event counts
counts = pd.read_csv("/kaggle/input/faers-signals/faers_drug_event_counts.csv")
# Top 10 drugs by number of reports
print(counts.groupby("DRUGNAME_NORM")["n_reports"].sum().nlargest(10))
# Load signals
signals = pd.read_csv("/kaggle/input/faers-signals/faers_signals_prr_ror.csv")
# Find signals for Metformin
metformin_signals = signals[signals["DRUGNAME_NORM"] == "METFORMIN"]
print(metformin_signals.head())
📌 Citation
If you use this dataset, please cite:
FDA FAERS (2024–2025). Processed by anurmi.
Data source: U.S. Food & Drug Administration (public domain).
🔖 Tags
pharmacovigilance adverse-events drug-safety FDA healthcare pharmacology signal-detection medical-data public-health time-series
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset was created as part of the research project “Python Under the Microscope: A Comparative Energy Analysis of Execution Methods” (2025). The study explores the environmental sustainability of Python software by benchmarking five execution strategies—CPython, PyPy, Cython, ctypes, and py_compile—across 15 classical algorithmic workloads.
With energy and carbon efficiency becoming critical in modern computing, this dataset aims to:
Quantify execution time, CPU energy usage, and carbon emissions
Enable reproducible analysis of performance–sustainability trade-offs
Introduce and validate the GreenScore, a composite metric for sustainability-aware software evaluation
All benchmarks were executed on a controlled laptop environment (Intel Core i5-1235U, Linux 6.8). Energy was measured via Intel RAPL counters using the pyRAPL library. Carbon footprint was estimated using a conversion factor of 0.000475 gCO₂ per joule based on regional electricity intensity.
Each algorithm–method pair was run 50 times, capturing robust statistics for energy (μJ), time (s), and derived CO₂ emissions.
Per-method folders (cpython/, pypy/, etc.) contain raw energy/ and time/ CSV files for all 15 benchmarks (50 trials each), as well as mean summaries.
Aggregate folder includes combined metric comparisons, normalized data, and carbon footprint estimations.
Analysis folder contains derived datasets: normalized scores, standard deviation, and the final GreenScore rankings used in our paper.
This dataset is ideal for:
Reproducible software sustainability studies
Benchmarking Python execution strategies
Analyzing energy–performance–carbon trade-offs
Validating green metrics and measurement tools
Researchers and practitioners are encouraged to use, extend, and cite this dataset in sustainability-aware software design.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains geospatial and remote sensing data for the La Dorada area, Colombia, prepared for deep learning tasks (e.g., water potential mapping using Conv1D/MLP models). The dataset is stored in NPZ format for easy loading with NumPy and TensorFlow.
The dataset consists of four main components:
x_var)[3584, 1097, 10] (rows, columns, channels)Channels:
Normalization: All values are normalized to the range [0, 1].
x_img)[3584, 1097, 3, 1] (rows, columns, channels, extra dimension for Conv1D compatibility)[0, 1]x_cat)[3584, 1097, 1] (rows, columns, channels)int32y)[3584, 1097, 1] (rows, columns, channels)[0, 1]The NPZ file contains all arrays in their original shapes:
import numpy as np
data = np.load("dataset_ladorada.npz")
x_var = data["x_var"] # shape: (3584, 1097, 10)
x_img = data["x_img"] # shape: (3584, 1097, 3, 1)
x_cat = data["x_cat"] # shape: (3584, 1097, 1)
y = data["y"] # shape: (3584, 1097, 1)
float32 for continuous variables and target, int32 for categorical data.[3584, 1097] for consistency.x_img ([..., 1]) allows direct usage in Conv1D layers without reshaping.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides a comprehensive view of student performance and learning behavior, integrating academic, demographic, behavioral, and psychological factors.
It was created by merging two publicly available Kaggle datasets, resulting in a unified dataset of 14,003 student records with 16 attributes. All entries are anonymized, with no personally identifiable information.
StudyHours, Attendance, Extracurricular, AssignmentCompletion, OnlineCourses, DiscussionsResources, Internet, EduTechMotivation, StressLevelGender, Age (18–30 years)LearningStyleExamScore, FinalGradeThe dataset can be used for:
ExamScore, FinalGrade)The dataset was analyzed in Python using:
LearningStyle categories & extracting insights for adaptive learningmerged_dataset.csv → 14,003 rows × 16 columns
Includes student demographics, behaviors, engagement, learning styles, and performance indicators.This dataset is an excellent playground for educational data mining — from clustering and behavioral analytics to predictive modeling and personalized learning applications.
Facebook
TwitterDataset Title: Data and Code for: "Universal Adaptive Normalization Scale (AMIS): Integration of Heterogeneous Metrics into a Unified System" Description: This dataset contains source data and processing results for validating the Adaptive Multi-Interval Scale (AMIS) normalization method. Includes educational performance data (student grades), economic statistics (World Bank GDP), and Python implementation of the AMIS algorithm with graphical interface. Contents: - Source data: educational grades and GDP statistics - AMIS normalization results (3, 5, 9, 17-point models) - Comparative analysis with linear normalization - Ready-to-use Python code for data processing Applications: - Educational data normalization and analysis - Economic indicators comparison - Development of unified metric systems - Methodology research in data scaling Technical info: Python code with pandas, numpy, scipy, matplotlib dependencies. Data in Excel format.