11 datasets found
  1. Chicago Data Portal

    • kaggle.com
    zip
    Updated Dec 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David (2020). Chicago Data Portal [Dataset]. https://www.kaggle.com/zhaodianwen/chicago-data-portal
    Explore at:
    zip(125083 bytes)Available download formats
    Dataset updated
    Dec 8, 2020
    Authors
    David
    Description

    Assignment Topic: In this assignment, you will download the datasets provided, load them into a database, write and execute SQL queries to answer the problems provided, and upload a screenshot showing the correct SQL query and result for review by your peers. A Jupyter notebook is provided in the preceding lesson to help you with the process.

    This assignment involves 3 datasets for the city of Chicago obtained from the Chicago Data Portal:

    1. Chicago Socioeconomic Indicators

    This dataset contains a selection of six socioeconomic indicators of public health significance and a hardship index, by Chicago community area, for the years 2008 – 2012.

    1. Chicago Public Schools

    This dataset shows all school level performance data used to create CPS School Report Cards for the 2011-2012 school year.

    1. Chicago Crime Data

    This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days.

    Instructions:

    1. Review the datasets

    Before you begin, you will need to become familiar with the datasets. Snapshots for the three datasets in .CSV format can be downloaded from the following links:

    Chicago Socioeconomic Indicators: Click here

    Chicago Public Schools: Click here

    Chicago Crime Data: Click here

    NOTE: Ensure you have downloaded the datasets using the links above instead of directly from the Chicago Data Portal. The versions linked here are subsets of the original datasets and have some of the column names modified to be more database friendly which will make it easier to complete this assignment. The CSV file provided above for the Chicago Crime Data is a very small subset of the full dataset available from the Chicago Data Portal. The original dataset is over 1.55GB in size and contains over 6.5 million rows. For the purposes of this assignment you will use a much smaller sample with only about 500 rows.

    1. Load the datasets into a database

    Perform this step using the LOAD tool in the Db2 console. You will need to create 3 tables in the database, one for each dataset, named as follows, and then load the respective .CSV file into the table:

    CENSUS_DATA

    CHICAGO_PUBLIC_SCHOOLS

    CHICAGO_CRIME_DATA

  2. OMOP primary database assessment of risk.

    • plos.figshare.com
    xls
    Updated Apr 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roger Ward; Christine Mary Hallinan; David Ormiston-Smith; Christine Chidgey; Dougie Boyle (2024). OMOP primary database assessment of risk. [Dataset]. http://doi.org/10.1371/journal.pone.0301557.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 18, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Roger Ward; Christine Mary Hallinan; David Ormiston-Smith; Christine Chidgey; Dougie Boyle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThe use of routinely collected health data for secondary research purposes is increasingly recognised as a methodology that advances medical research, improves patient outcomes, and guides policy. This secondary data, as found in electronic medical records (EMRs), can be optimised through conversion into a uniform data structure to enable analysis alongside other comparable health metric datasets. This can be achieved with the Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM), which employs a standardised vocabulary to facilitate systematic analysis across various observational databases. The concept behind the OMOP-CDM is the conversion of data into a common format through the harmonisation of terminologies, vocabularies, and coding schemes within a unique repository. The OMOP model enhances research capacity through the development of shared analytic and prediction techniques; pharmacovigilance for the active surveillance of drug safety; and ‘validation’ analyses across multiple institutions across Australia, the United States, Europe, and the Asia Pacific. In this research, we aim to investigate the use of the open-source OMOP-CDM in the PATRON primary care data repository.MethodsWe used standard structured query language (SQL) to construct, extract, transform, and load scripts to convert the data to the OMOP-CDM. The process of mapping distinct free-text terms extracted from various EMRs presented a substantial challenge, as many terms could not be automatically matched to standard vocabularies through direct text comparison. This resulted in a number of terms that required manual assignment. To address this issue, we implemented a strategy where our clinical mappers were instructed to focus only on terms that appeared with sufficient frequency. We established a specific threshold value for each domain, ensuring that more than 95% of all records were linked to an approved vocabulary like SNOMED once appropriate mapping was completed. To assess the data quality of the resultant OMOP dataset we utilised the OHDSI Data Quality Dashboard (DQD) to evaluate the plausibility, conformity, and comprehensiveness of the data in the PATRON repository according to the Kahn framework.ResultsAcross three primary care EMR systems we converted data on 2.03 million active patients to version 5.4 of the OMOP common data model. The DQD assessment involved a total of 3,570 individual evaluations. Each evaluation compared the outcome against a predefined threshold. A ’FAIL’ occurred when the percentage of non-compliant rows exceeded the specified threshold value. In this assessment of the primary care OMOP database described here, we achieved an overall pass rate of 97%.ConclusionThe OMOP CDM’s widespread international use, support, and training provides a well-established pathway for data standardisation in collaborative research. Its compatibility allows the sharing of analysis packages across local and international research groups, which facilitates rapid and reproducible data comparisons. A suite of open-source tools, including the OHDSI Data Quality Dashboard (Version 1.4.1), supports the model. Its simplicity and standards-based approach facilitates adoption and integration into existing data processes.

  3. Market Basket Analysis

    • kaggle.com
    zip
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    zip(23875170 bytes)Available download formats
    Dataset updated
    Dec 9, 2021
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  4. ECE657AW20-ASG4-Coronavirus

    • kaggle.com
    zip
    Updated Nov 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MarkCrowley (2025). ECE657AW20-ASG4-Coronavirus [Dataset]. https://www.kaggle.com/markcrowley/ece657aw20asg4coronavirus
    Explore at:
    zip(1659403 bytes)Available download formats
    Dataset updated
    Nov 1, 2025
    Authors
    MarkCrowley
    Description

    COVID-19 Data for Analysis and Machine Learning

    There are lots of datasets online, more growing every day, to help us all get a handle on this pandemic. Here are just a few links to data we've found that students in ECE 657A, and anyone else who finds their way here, can play with and practice their machine learning skills. The main dataset is the COVID-19 dataset from John Hopkins university. This data is perfect for time series analysis and Recurrent Neural Networks, the final topic in the course. This dataset will be left public so anyone can see it but to join you must request the link from Prof. Crowley or be in the ECE 657A W20 course at the University of Waterloo.

    For ECE 657A W20 Students

    Your bonus grade for assignment 4 comes from creating a kernel from this dataset and writing up some useful analysis and publishing that notebook. You can do any kind of analysis you like but some good places to start are - Analysis: feature extraction and analysis of the data to look for patterns that aren't evident from the original features (this is hard for the simple spread/infection/death data since there aren't that many features) - Other Data: utilize any other datasets in your kernels by loading data about the countries themselves (population, density, wealthy etc.) or their responses to the situation. Tip: If you open a New Notebook related to this dataset you can easily add new data available on Kaggle and link that to you analysis. - HOW'S MY FLATTENING COVID19 DATASET - This dataset has a lot more files and includes a lot of what I was talking about, so if you produce good kernels there you can also count them for your asg4 grade. https://www.kaggle.com/howsmyflattening/covid19-challenges - Predict: make predictions about confirmed cases, deaths, recoveries or other metrics for the future. You can test you models by training on the past and predicting on the following days, then post a prediction for tomorrow or the next few days given ALL the data up to this point. Hopefully the datasets we've linked here will updated automatically so your kernels would update as well. - Create Tasks: you can make your own "Tasks" as part of this kaggle and propose your own solution to it. Then others can try solving it as well. - Groups: students can do this assignment either in the same groups they had for assignment 3 or individually.

    Suggest other datasets

    We're happy to add other relevant data to this Kaggle, in particular it would be great to integrate live data on the following: - Progression of each country/region/city in "days since X Level" such as Days since 100 confirmed cases, see the link for a great example such a dataset being plotted. I haven't see a live link to a csv of that data, but we could generate. - Mitigation Policies enacted by local governments in each city/region/country. These are dates when that region first enacted Level 1, 2, 3, 4 containment, or started encouraging social distancing or the date when they closed different levels of schools, pubs, restaurants etc. - The hidden positives: this would be a dataset, or method for estimating, as described by Emtiyaz Khan in this twitter thread. The idea is, how many unreported or unconfirmed cases are there in any region, and can we build an estimate of that number using other regions with widespread testing as a baseline and the death rates which are like an observation of a process with a hidden variable or true infection rate. - Paper discussing one way to compute this : https://cmmid.github.io/topics/covid19/severity/global_cfr_estimates.html

  5. Z

    Experiments with Frequency Fitness Assignment based Algorithms on the...

    • data.niaid.nih.gov
    Updated Dec 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tianyu LIANG (梁天宇) (2023). Experiments with Frequency Fitness Assignment based Algorithms on the Traveling Salesperson Problem [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7851753
    Explore at:
    Dataset updated
    Dec 1, 2023
    Dataset provided by
    Institute of Applied Optimization, School of Artificial Intelligence and Big Data, Hefei University
    Authors
    Tianyu LIANG (梁天宇)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    1. Introduction In this archive, we provide the implementation and experimental results of eight different algorithms to solve Traveling Salesperson Problem (TSP) instances from TSPLIB. A TSP is defined by a fully-connected weighted graph of n cities. The goal is to find the overall shortest tour that visits each cities exactly once and returns to its starting point. The TSP is NP-hard. We consider 56 symmetric instances from the well-known TSPLIB. Solutions in our work are stored in the path representation, where such a tour is encoded as a permutation x of the numbers 1 to n, each identifying a city. If a city appears at index j in the permutation x, then it will be the jth city to be visited. This means that a tour x will pass the following edges: (x[1], x[2]), (x[2], x[3]), (x[3], x[4]), … (x[n-1], x[n]), (x[n], x[1]).
    2. Directory Structure This dataset is split into multiple separate tar.xz archives. These can be unpacked in the same folder and will produce the directory structure described below. Each archive contains this note and the license information, but apart from that, there is no redundancy. This archive contains the following directories:

    source contains the Python source codes needed to run the experiment.

    moptipy-main is a local copy of the moptipy package used for our experiment. tsplib contains the TSPLIB data. This includes the instances used in our experiments as files in text format with suffix .tsp. If an optimal tour is given by TSPLib, it is stored in a text format file with suffix .opt.tour and name prefix identical to the instance file. In other words, the file eil51.tsp contains the TSP instance eil51 and the file eil51.opt.tour contains the corresponding optimal tour. Both the TSP instances and optimal tours can be downloaded from http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/tsp/. We also include the documentation of TSPLIB in file tsp95.pdf documenting them. We further include the TSPLIB FAQ both as HTML and PDF file (tsplib_faq.html and tsplib_faq.pdf) and the list of known optimal tour lengths as HTML and PDF file (optimal_tour_lengths_of_symmetric_tsps.html, optimal_tour_lengths_of_symmetric_tsps.pdf). Notice that, while the TSP instances we used are Euclidean, all distances are converted to integers as prescribed by the documentation. results is the directory with the log files. Each log file contains information of one run, i.e., one execution of one algorithm on one problem instance. All improving moves of a run as well as the final solution are stored in the log file. The direct sub-folders results represent the algorithms and contain one folder per TSP instance, which, in turn, contain the log files. evaluation is a folder with the extracted evaluation and figures evaluation_edited is a folder with evaluation figures slightly edited for better visual appeal (but obviously without changing any result / scientific content) evaluator is a folder with a Python script main.py that generates all the files in evaluation from the data it finds in results. It requires the moptipy package being installed for running in the version given in requirements.txt. 3. Algorithms The (1+1) EA is the most basic evolutionary algorithm and also be considered as a randomized local search. It starts with one random solution/permutation xc and computes its length yc=f(xc). In each iteration, it applies a unary search operator op to obtain a new tour xn=op(xc) and computes its length yn=f(xn). If yn<=yc, then it will accept the new tour and set xn=xn and yc=yn. The results of this algorithm are given in folder results/ea_revn. FFA is a fitness assignment process that takes place before this last step in the EA. We integrate FFA into the (1+1) EA and obtain the (1+1) FEA. This algorithm uses an additional table H which counts, for any tour length y, how often it has been seen during the search so far. After the new tour xn is created and its objective value yn is computed, the (1+1) FEA sets H[yc] = H[yc] + 1 and H[yn] = H[yn] + 1. It will accept xn if and only if H[yn] <= H[yc] and, only in this case, set xn=xn and yc=yn. The results of this algorithm are given in folder results/fea_revn. SA is the classical simulated annealing algorithm. In our experiment, it will accept the new solution xn with probability P. If the new solution is better, the acceptance probability P is 1. For worse solutions, the probability is between 0 and 1, i.e., sometimes, worse solution are also accepted. This algorithm has a temperature cooling schedule. It starts at an initial temperature and over time, the temperature decreases. The probability P of accepting the worse solution depends on the temperature and decreases as well. The results of this algorithm are given in folder results/sa_revn. An FFA-based version of SA uses the frequency fitness instead of the objective values in all acceptance decisions. The results of this algorithm are given in folder results/fsa_revn. EAFEA(A) is a hybrid which alternates between the EA and the FEA and copies a solution from the FEA to the EA if it has an entirely new objective value, i.e., if H[yn] = 1. The results of this algorithm are given in folder results/eafea2_revn. SAFEA(A) is a hybrid which alternates between the SA and the FEA and copies a solution from the FEA to the SA if it has an entirely new objective value, i.e., if H[yn] = 1. The results of this algorithm are given in folder results/safea2_revn. EAFEA(B) is a hybrid which alternates between the EA and the FEA and copies a solution from the FEA to the EA part if it has a better objective value. The results of this algorithm are given in folder results/eafea_revn. SAFEA(B) is a hybrid which alternates between the SA and the FEA and copies a solution from the FEA to the SA part if it has a better objective value. The results of this algorithm are given in folder results/safea_revn. We apply all algorithms with the same unary operator reverse, which reverses a randomly chosen subsequence of the tour. This operator is also often called a "2-opt move". It has the advantage that the new objective value of a new solution can be computed in O(1) if the objective value of the solution from which it is derived is known. 4. How to Run the Experiment First, you need to make sure to have all the dependencies installed that this program requires. You can do this by executing the following command in the terminal: pip install matplotlib numba numpy psutil scikit-learn moptipy moptipyapps Now enter the source directory, i.e., the directory containing the run.py file, in your terminal. Depending on your system configuration and whether you run Windows or Linux, you can start the program with one of the commands below. (If running the first command returns with an error, just try the next one in the list.)

    python3 -m run python -m run python run.py python3 run.py Then the experiment will run. It will automatically create a sub-folder results in source and place all log files that are generated into it. Be careful: The experiment will take a long time. However, if you have multiple CPUs, you can simply start several instances of this program in independent terminals. Each instance will then conduct different runs. This also works if this folder is shared over the network, in which case you can run multiple processes on multiple PCs. Side note: This experiment uses the moptipy package for implementing its algorithms, running the experiments, and gathering their results. If you want to install moptipy on your system instead of using the version supplied here, you can install it via pip install moptipy. It also uses moptipyapps to load some data. 5. Literature

    Frequency Fitness Assignment (FFA):

    Thomas Weise, Zhize Wu, Xinlu Li, Yan Chen, and Jörg Lässig. Frequency Fitness Assignment: Optimization without Bias for Good Solutions can be Efficient. IEEE Transactions on Evolutionary Computation (TEVC). 2022. Early Access. doi:10.1109/TEVC.2022.3191698. Thomas Weise, Zhize Wu, Xinlu Li, and Yan Chen. Frequency Fitness Assignment: Making Optimization Algorithms Invariant under Bijective Transformations of the Objective Function Value. IEEE Transactions on Evolutionary Computation 25(2):307–319. April 2021. Preprint available at arXiv:2001.01416v5 [cs.NE] 15 Oct 2020. doi:10.1109/TEVC.2020.3032090. Experimental results and source code are available at doi:10.5281/zenodo.3899474. Tianyu Liang, Zhize Wu, Jörg Lässig, Daan van den Berg, and Thomas Weise. Solving the Traveling Salesperson Problem using Frequency Fitness Assignment. In Hisao Ishibuchi, Chee-Keong Kwoh, Ah-Hwee Tan, Dipti Srinivasan, Chunyan Miao, Anupam Trivedi, and Keeley A. Crockett, editors, Proceedings of the IEEE Symposium on Foundations of Computational Intelligence (IEEE FOCI'22), part of the IEEE Symposium Series on Computational Intelligence (SSCI 2022). December 4–7, 2022, Singapore, pages 360–367. IEEE. doi:10.1109/SSCI51031.2022.10022296. Thomas Weise, Mingxu Wan, Ke Tang, Pu Wang, Alexandre Devert, and Xin Yao. Frequency Fitness Assignment. IEEE Transactions on Evolutionary Computation (IEEE-EC) 18(2):226-243, April 2014. doi:10.1109/TEVC.2013.2251885. Thomas Weise, Xinlu Li, Yan Chen, and Zhize Wu. Solving Job Shop Scheduling Problems Without Using a Bias for Good Solutions. In Genetic and Evolutionary Computation Conference Companion (GECCO'21 Companion), July 10-14, 2021, Lille, France. ACM, New York, NY, USA. ISBN 978-1-4503-8351-6. doi:10.1145/3449726.3463124. Thomas Weise, Yan Chen, Xinlu Li, and Zhize Wu. Selecting a diverse set of benchmark instances from a tunable model problem for black-box discrete optimization algorithms. Applied Soft Computing Journal (ASOC), 92:106269, June 2020. doi:10.1016/j.asoc.2020.106269. Thomas Weise, Mingxu Wan, Ke Tang, and Xin Yao. Evolving Exact Integer Algorithms with Genetic Programming. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC'14), Proceedings of the 2014 World

  6. Politeknik A Students’ Academic Records

    • kaggle.com
    zip
    Updated Oct 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Purnama Ridzky Nugraha (2025). Politeknik A Students’ Academic Records [Dataset]. https://www.kaggle.com/datasets/purnamaridzkynugraha/politeknik-a-students-academic-records
    Explore at:
    zip(662047 bytes)Available download formats
    Dataset updated
    Oct 16, 2025
    Authors
    Purnama Ridzky Nugraha
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This is a synthetic dataset representing the academic performance of students from Department A at Politeknik B. It contains semester-wise data across 8 semesters. The dataset is designed to simulate real-life trends and challenges in student performance:

    • Generational trend: The GPA per semester tends to decrease for more recent cohorts (younger batches).
    • Semester 8 exception: In the final semester, students focus on their thesis/project instead of regular courses. As a result, GPA values do not follow the usual trend, since students prioritize graduation rather than maximizing GPA.
    • Analytical challenge: The dataset encourages exploration of why GPA trends increase or decrease across semesters and batches. Students can analyze which factors (theory/practical load, assignments, absences) most affect performance.

    Feature Description

    FeatureTypeDescription
    theory_coursesintNumber of theory courses taken in the semester.
    theory_creditsintTotal credit units for theory courses.
    theory_total_hoursintWeekly hours spent on theory courses.
    theory_assignmentsintNumber of individual assignments in theory courses.
    theory_group_assignmentsintNumber of group assignments in theory courses.
    practical_coursesintNumber of practical/lab courses taken in the semester.
    practical_creditsintTotal credit units for practical courses.
    practical_total_hoursintWeekly hours spent on practical courses.
    practical_assignmentsintNumber of individual assignments in practical courses.
    practical_group_assignmentsintNumber of group assignments in practical courses.
    sickintNumber of absences due to sickness.
    permissionintNumber of excused absences.
    absenceintNumber of unexcused absences.
    assignment_delayintTotal delayed assignments in the semester.
    student_idstringUnique identifier for each student.
    semesterintSemester number (1–8).
    yearintStudent entry year / batch. Older batches = higher GPA trend, newer batches = lower trend.
    gpafloatSemester GPA (range ~2.5–4.0).

    Use Case / Analytical Challenge: This dataset is perfect for:

    1. Exploring trends of GPA across generations (why newer batches may have declining GPA).
    2. Understanding the impact of theory/practical workload, assignments, and attendance on student performance.
    3. Investigating why final semester GPA (semester 8) might break trends due to thesis/project focus.
    4. Predictive modeling: estimate GPA based on features per semester.
  7. G

    Dock Door Self-Assignment Kiosks Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Dock Door Self-Assignment Kiosks Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/dock-door-self-assignment-kiosks-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Aug 22, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Dock Door Self-Assignment Kiosks Market Outlook



    According to our latest research, the global Dock Door Self-Assignment Kiosks market size reached USD 1.32 billion in 2024, with a robust compound annual growth rate (CAGR) of 12.8% projected through the forecast period. By 2033, the market is anticipated to reach USD 3.60 billion, driven by the increasing adoption of automation technologies across logistics, warehousing, and distribution sectors. The accelerated demand for operational efficiency, real-time data processing, and reduction in manual labor are pivotal factors fueling this significant market expansion.



    The primary growth factor for the Dock Door Self-Assignment Kiosks market is the escalating need for streamlined logistics and warehouse management. As global supply chains grow in complexity and scale, organizations are seeking advanced solutions to minimize bottlenecks and enhance throughput. Dock door self-assignment kiosks allow for automated scheduling and assignment of loading docks, reducing idle times and optimizing resource utilization. This automation not only accelerates turnaround but also minimizes human error, which can be costly in high-volume environments. The integration of such kiosks with warehouse management systems (WMS) and transportation management systems (TMS) further boosts their value proposition, enabling seamless communication and data exchange across the supply chain ecosystem.



    Another vital growth driver is the rapid expansion of e-commerce and omnichannel retailing, which has placed unprecedented pressure on distribution centers and logistics hubs to process higher volumes of goods swiftly and accurately. Dock door self-assignment kiosks play a crucial role in meeting these demands by automating the check-in, assignment, and scheduling processes for inbound and outbound shipments. The ability to handle increased shipment volumes without proportional increases in labor costs positions these kiosks as a strategic investment for businesses aiming to scale operations efficiently. Additionally, the COVID-19 pandemic has underscored the importance of contactless solutions, further accelerating the adoption of self-service kiosks in logistics and warehousing environments.



    Technological advancements in kiosk hardware and software are also propelling market growth. Modern kiosks are equipped with advanced touchscreens, robust connectivity options, and sophisticated software capable of integrating with enterprise resource planning (ERP) systems. The increasing availability of cloud-based solutions has made it easier for businesses of all sizes to deploy and manage these kiosks across multiple locations. Furthermore, enhancements in user interface design and mobile integration are making these systems more accessible and user-friendly, reducing training requirements and increasing adoption rates. The growing emphasis on data analytics and real-time monitoring is also driving demand, as businesses seek to leverage actionable insights for continuous improvement in dock operations.



    From a regional perspective, North America currently dominates the Dock Door Self-Assignment Kiosks market, accounting for the largest share due to the presence of advanced logistics infrastructure, high penetration of automation technologies, and significant investments in supply chain modernization. Europe follows closely, driven by stringent regulations regarding workplace safety and efficiency, as well as the increasing adoption of smart logistics solutions. The Asia Pacific region is expected to exhibit the highest CAGR over the forecast period, fueled by rapid industrialization, the expansion of e-commerce, and government initiatives to modernize logistics infrastructure. Latin America and the Middle East & Africa are also witnessing steady growth, albeit from a smaller base, as businesses in these regions increasingly recognize the benefits of automation in logistics and warehousing operations.





    Product Type Analysis



    The Dock Door Self-Assignment Kiosks market is segmented by product type into

  8. Z

    Data from: ML-Optimized QKD Frequency Assignment for Efficient...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehdizadeh, Pouya; Dibaj, Mohammad reza; Beyranvand, Hamzeh; Arpanaei, Farhad (2024). ML-Optimized QKD Frequency Assignment for Efficient Quantum-Classical Coexistence in Multi-Band EONs [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13364897
    Explore at:
    Dataset updated
    Aug 23, 2024
    Dataset provided by
    Amirkabir University of Technology (Tehran Polytechnic)
    Universidad Carlos III de Madrid
    Authors
    Mehdizadeh, Pouya; Dibaj, Mohammad reza; Beyranvand, Hamzeh; Arpanaei, Farhad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract: Quantum key distribution (QKD) represents a cutting-edge technology that ensures unbreakable security. Coexisting quantum and classical signals on a multi-band (O+E+S+C+L-band) system offer a viable solution for secure, high-rate networks amidst growing classical traffic and address quantum signal sensitivity. In this study, we assume a dynamic classical traffic load and varying configurations of classical channels (CChs). Considering the varying behavior of Secure Key Rate (SKR) under different classical conditions, solving the integral noise equations are crucial for optimizing QKD implementation and enhancing resource efficiency. The complexity and time-consuming nature of this process challenge infrastructure providers in determining the optimal quantum channel (QCh) frequency in real time. To tackle these challenges, we propose a machine learning (ML) algorithm. By leveraging ML, QKD can be implemented efficiently, optimizing resource utilization while significantly reducing computation and processing time in dynamic classical traffic. We implement three ML algorithms at various fiber intervals, all of which estimate the optimal frequency for QCh with 99\% accuracy and perform computations on average in 0.09 seconds, which is significantly faster compared to integral computational methods that have a mean time of 637 seconds.Information: In this file, the Excel sheet contains data for each fiber interval, including inputs such as fiber length in each interval, the overall classical loading factor percentage, the C-band loading factor percentage, the L-band loading factor percentage, the highest active classical frequency (which serves as input to the machine learning model), and the QCh frequency that resulted in the highest SKR.

  9. Train Data

    • data.nsw.gov.au
    data, xls
    Updated Apr 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Transport for NSW (2021). Train Data [Dataset]. https://data.nsw.gov.au/data/dataset/train-data
    Explore at:
    xls(39568), dataAvailable download formats
    Dataset updated
    Apr 20, 2021
    Dataset authored and provided by
    Transport for NSWhttp://www.transport.nsw.gov.au/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Train patronage dataset provides Opal Train Trips by month, operator, contract area and card type. An Opal trip is where an Opal card is used to tap-on and tap-off, including where a single tap-on or tap-off is recorded. All other travel is not included.

    Peak Train Load Survey 2016 provides estimates of train loads during the AM and PM peak periods - derived from a survey held on Tuesdays, Wednesdays and Thursdays from 1 March 2016 to 17 March 2016. Since the introduction of Opal, this has been replaced by Peak Train Load Estimates from 2017 using the Rail Opal Assignment Model (ROAM).

    Peak Train Estimates use data extracted from the Rail Opal Assignment Model (ROAM). This model assigns Opal journeys to services based on the rail daily working timetable and train punctuality data. The customer load is aggregated to train services and assigned to rail lines.

    Train station entries and exits dataset is based on the average of three day sample, representing 'a typical day' of customer entries and exits at each train station.

  10. Medication table mappings.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Apr 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roger Ward; Christine Mary Hallinan; David Ormiston-Smith; Christine Chidgey; Dougie Boyle (2024). Medication table mappings. [Dataset]. http://doi.org/10.1371/journal.pone.0301557.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 18, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Roger Ward; Christine Mary Hallinan; David Ormiston-Smith; Christine Chidgey; Dougie Boyle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThe use of routinely collected health data for secondary research purposes is increasingly recognised as a methodology that advances medical research, improves patient outcomes, and guides policy. This secondary data, as found in electronic medical records (EMRs), can be optimised through conversion into a uniform data structure to enable analysis alongside other comparable health metric datasets. This can be achieved with the Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM), which employs a standardised vocabulary to facilitate systematic analysis across various observational databases. The concept behind the OMOP-CDM is the conversion of data into a common format through the harmonisation of terminologies, vocabularies, and coding schemes within a unique repository. The OMOP model enhances research capacity through the development of shared analytic and prediction techniques; pharmacovigilance for the active surveillance of drug safety; and ‘validation’ analyses across multiple institutions across Australia, the United States, Europe, and the Asia Pacific. In this research, we aim to investigate the use of the open-source OMOP-CDM in the PATRON primary care data repository.MethodsWe used standard structured query language (SQL) to construct, extract, transform, and load scripts to convert the data to the OMOP-CDM. The process of mapping distinct free-text terms extracted from various EMRs presented a substantial challenge, as many terms could not be automatically matched to standard vocabularies through direct text comparison. This resulted in a number of terms that required manual assignment. To address this issue, we implemented a strategy where our clinical mappers were instructed to focus only on terms that appeared with sufficient frequency. We established a specific threshold value for each domain, ensuring that more than 95% of all records were linked to an approved vocabulary like SNOMED once appropriate mapping was completed. To assess the data quality of the resultant OMOP dataset we utilised the OHDSI Data Quality Dashboard (DQD) to evaluate the plausibility, conformity, and comprehensiveness of the data in the PATRON repository according to the Kahn framework.ResultsAcross three primary care EMR systems we converted data on 2.03 million active patients to version 5.4 of the OMOP common data model. The DQD assessment involved a total of 3,570 individual evaluations. Each evaluation compared the outcome against a predefined threshold. A ’FAIL’ occurred when the percentage of non-compliant rows exceeded the specified threshold value. In this assessment of the primary care OMOP database described here, we achieved an overall pass rate of 97%.ConclusionThe OMOP CDM’s widespread international use, support, and training provides a well-established pathway for data standardisation in collaborative research. Its compatibility allows the sharing of analysis packages across local and international research groups, which facilitates rapid and reproducible data comparisons. A suite of open-source tools, including the OHDSI Data Quality Dashboard (Version 1.4.1), supports the model. Its simplicity and standards-based approach facilitates adoption and integration into existing data processes.

  11. f

    EMR tables and related tables in the OMOP CDM.

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Apr 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roger Ward; Christine Mary Hallinan; David Ormiston-Smith; Christine Chidgey; Dougie Boyle (2024). EMR tables and related tables in the OMOP CDM. [Dataset]. http://doi.org/10.1371/journal.pone.0301557.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 18, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Roger Ward; Christine Mary Hallinan; David Ormiston-Smith; Christine Chidgey; Dougie Boyle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThe use of routinely collected health data for secondary research purposes is increasingly recognised as a methodology that advances medical research, improves patient outcomes, and guides policy. This secondary data, as found in electronic medical records (EMRs), can be optimised through conversion into a uniform data structure to enable analysis alongside other comparable health metric datasets. This can be achieved with the Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM), which employs a standardised vocabulary to facilitate systematic analysis across various observational databases. The concept behind the OMOP-CDM is the conversion of data into a common format through the harmonisation of terminologies, vocabularies, and coding schemes within a unique repository. The OMOP model enhances research capacity through the development of shared analytic and prediction techniques; pharmacovigilance for the active surveillance of drug safety; and ‘validation’ analyses across multiple institutions across Australia, the United States, Europe, and the Asia Pacific. In this research, we aim to investigate the use of the open-source OMOP-CDM in the PATRON primary care data repository.MethodsWe used standard structured query language (SQL) to construct, extract, transform, and load scripts to convert the data to the OMOP-CDM. The process of mapping distinct free-text terms extracted from various EMRs presented a substantial challenge, as many terms could not be automatically matched to standard vocabularies through direct text comparison. This resulted in a number of terms that required manual assignment. To address this issue, we implemented a strategy where our clinical mappers were instructed to focus only on terms that appeared with sufficient frequency. We established a specific threshold value for each domain, ensuring that more than 95% of all records were linked to an approved vocabulary like SNOMED once appropriate mapping was completed. To assess the data quality of the resultant OMOP dataset we utilised the OHDSI Data Quality Dashboard (DQD) to evaluate the plausibility, conformity, and comprehensiveness of the data in the PATRON repository according to the Kahn framework.ResultsAcross three primary care EMR systems we converted data on 2.03 million active patients to version 5.4 of the OMOP common data model. The DQD assessment involved a total of 3,570 individual evaluations. Each evaluation compared the outcome against a predefined threshold. A ’FAIL’ occurred when the percentage of non-compliant rows exceeded the specified threshold value. In this assessment of the primary care OMOP database described here, we achieved an overall pass rate of 97%.ConclusionThe OMOP CDM’s widespread international use, support, and training provides a well-established pathway for data standardisation in collaborative research. Its compatibility allows the sharing of analysis packages across local and international research groups, which facilitates rapid and reproducible data comparisons. A suite of open-source tools, including the OHDSI Data Quality Dashboard (Version 1.4.1), supports the model. Its simplicity and standards-based approach facilitates adoption and integration into existing data processes.

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
David (2020). Chicago Data Portal [Dataset]. https://www.kaggle.com/zhaodianwen/chicago-data-portal
Organization logo

Chicago Data Portal

a real world dataset provided by the Chicago Data Portal

Explore at:
zip(125083 bytes)Available download formats
Dataset updated
Dec 8, 2020
Authors
David
Description

Assignment Topic: In this assignment, you will download the datasets provided, load them into a database, write and execute SQL queries to answer the problems provided, and upload a screenshot showing the correct SQL query and result for review by your peers. A Jupyter notebook is provided in the preceding lesson to help you with the process.

This assignment involves 3 datasets for the city of Chicago obtained from the Chicago Data Portal:

  1. Chicago Socioeconomic Indicators

This dataset contains a selection of six socioeconomic indicators of public health significance and a hardship index, by Chicago community area, for the years 2008 – 2012.

  1. Chicago Public Schools

This dataset shows all school level performance data used to create CPS School Report Cards for the 2011-2012 school year.

  1. Chicago Crime Data

This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days.

Instructions:

  1. Review the datasets

Before you begin, you will need to become familiar with the datasets. Snapshots for the three datasets in .CSV format can be downloaded from the following links:

Chicago Socioeconomic Indicators: Click here

Chicago Public Schools: Click here

Chicago Crime Data: Click here

NOTE: Ensure you have downloaded the datasets using the links above instead of directly from the Chicago Data Portal. The versions linked here are subsets of the original datasets and have some of the column names modified to be more database friendly which will make it easier to complete this assignment. The CSV file provided above for the Chicago Crime Data is a very small subset of the full dataset available from the Chicago Data Portal. The original dataset is over 1.55GB in size and contains over 6.5 million rows. For the purposes of this assignment you will use a much smaller sample with only about 500 rows.

  1. Load the datasets into a database

Perform this step using the LOAD tool in the Db2 console. You will need to create 3 tables in the database, one for each dataset, named as follows, and then load the respective .CSV file into the table:

CENSUS_DATA

CHICAGO_PUBLIC_SCHOOLS

CHICAGO_CRIME_DATA

Search
Clear search
Close search
Google apps
Main menu