Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a copy of the original Boston Housing Dataset. As of December 2021, the original link doesn't contain the dataset so I'm uploading it if anyone wants to use it. I'll implement a linear regression model to predict the output 'MEDV' variable using PyTorch (check the companion notebook).
I took the data given in this link and processed it to include the column names as well.
https://www.kaggle.com/prasadperera/the-boston-housing-dataset/data
Good luck on your data science career :)
Facebook
Twitteri. .\File_Mapping.csv: This file relates historical reconstructed hydrology streamflow from the U.S. Army Corps of Engineers () to the appropriate stochastic streamflow file for disaggregation of streamflow. Column A is an assigned ID, column B is named “Stochastic” and is the stochastic streamflow file needed for disaggregation, column c is called “RH_Ratio_Col” and is the name of the column in the reconstructed hydrology dataset associated with a stochastic streamflow file, and column D is named “Col_Num” and is the column number in the reconstructed hydrology dataset with the name given in column C. ii. .\Original_Draw_YearDat.csv: This file contains the historical year from 1930 to 2017 with the closest total streamflow for the Souris River Basin to each year in the stochastic streamflow dataset. Column A is an index number, column B is named “V1” and is the year in a simulation, column C is called “V2” and is the stochastic simulation number, column D is an integer that can be related to historical years by adding 1929, and column D is named “year” and is the historical year with the closest total Souris River Basin streamflow volume to the associated year in the stochastic traces. iii. .\revdrawyr.csv: This file is setup the same way that .\Original_Draw_YearDat.csv was except that, when a year had over 400 occurrences, it was randomly replaced with one of the 20 other closest years. The replacement process was completed until there were less than 400 occurrences of each reconstructed hydrology year associated with stochastic simulation years. Column A is an index number, column B is named “V1” and is the year in a simulation, column C is called “V2” and is the stochastic simulation number, column D is called “V3” and is the historical year who’s streamflow ratios will be multiplied by stochastic streamflow, and column E is called “Stoch_yr” and is the total of 2999 and the year in column B. iv. .\RH_1930_2017.csv: This file contains the daily streamflow from the U.S. Army Corps of Engineers (2020), reconstructed hydrology for the Souris River Basin for the period of 1930 to 2017. Column A is the date and columns B through AA are the daily streamflow in cubic feet per second. v. .\rhmoflow_1930Present.csv: This file was created based on .\RH_1930_2017.csv and provides streamflow for each site in cubic meters for a given month. Column A is an unnamed index column, column B is historical year, column C is the historical month associated with the historical year, column D provides a day equal to 1 but does not have particular significance and columns E through AD are monthly streamflow volume for each site location. vi. .\Stoch_Annual_TotVol_CubicDecameters.csv: This file contains the total volume of streamflow for each of the 26 sites for each month in the stochastic streamflow time timeseries and provides a total streamflow volume divided by 100,000 on a monthly basis for the entire Souris River Basin. Column A is unnamed and contains an index number, column B is month and is named “V1”, column C is the year in a simulation, column D is the simulation number, columns E (V4 through V29) through AD are streamflow volume in cubic meters, and column AE (V30) is total Souris River Basin monthly streamflow volume in cubic decameters/1,000.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset of vertical temperature and salinity profiles obtained at various locations across the Hornsund fjord. Several CTD instruments have been used for data collection: Valeport miniCTD, two separate SAIV A/S 208 STD/CTDs and two separate RBR concerto CTDs. The data are stored in folders organized by the year (YYYY) of measurements. Each vertical profile is stored as an individual, tab-separated ASCII file. The filenames are formed from the date (and time) of measurement followed by the instrument and station names: YYYYMMDD_instrument_station.txt or YYYYMMDDhhmmss_instrument_station.txt. Each file includes eight headerlines with information on station name, geographical location (decimal degrees), bottom depth at the location (m), date (and time) of measurement (YYYY-MM-DDThh:mm:ss), instrument and its serial number, source of financial support and data column names. There are seven data columns: pressure (dbar), depth (m), temperature (°C), potential temperature (°C), practical salinity (PSU), SigmaT density (kg/m**3) and sound velocity (m/s). The data are averaged to 1-dbar vertical bins. Before averaging, data are visually inspected and suspicious data are removed. Based on inter-calibration between the instruments, a linear correction has been calculated for temperature and conductivity and added to the measurements by SAIV A/S 208 CTD. In general, both down- and up-profiles are used for averaging. Finally, the data is interpolated and smoothed.
Facebook
TwitterDataset of vertical temperature, turbidity and dissolved oxygen profiles obtained from the Revvatnet, lake close to Hornsund fjord. The measurements are made with SAIV A/S 208 STD/CTD (until 2023) and two separate RBR concerto CTDs (since 2024). The data are stored in folders organized by the year (YYYY) of measurements. Each vertical profile is stored as an individual, tab-separated ASCII file. The filenames are formed from the date and time of measurement followed by the instrument, potential additional sensors and station names: YYYYMMDDhhmmss_instrument-sensors_station.txt. Each file includes eight headerlines with information on station name, geographical location (UTM), date and time of measurement (YYYY/MM/DD hh:mm), instrument and its serial number, source of financial support and data column names. The data columns include pressure (dbar), temperature (°C), turbidity (FTU/NTU), dissolved oxygen saturation (%) and dissolve oxygen concentration (mg/l). Measurements by RBR concerto CTDs have additional columns for Chlorophyll a fluorescence (μg/l) and Photosynthetically Active Radiation (PAR, μmol/m^2/s). Note that this is a raw dataset without quality control.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
this dataset presents a long-term monitoring record of phytoplankton (2010-2022) and zooplankton (2010-2023) taxonomic groups, alongside associated environmental parameters (surface and bottom temperature and salinity measurements) from the iroise marine natural park, france's first marine protected area. the dataset integrates traditional microscopy-based phytoplankton counts with zooplankton imaging data obtained using the zooscan (gorsky et al., 2010), as well as zooplankton biovolume and concentration data.sampling was conducted seasonally along two main coastal-offshore transects (b and d) and at three coastal stations (molène, sein, and douarnenez), capturing the spatial and temporal dynamics of plankton communities in this unique ecosystem located at the intersection of the english channel and the atlantic ocean. the region is characterized by the seasonal ushant thermal front, which creates diverse habitats supporting rich plankton communities.phytoplankton identification was performed consistently by the same taxonomist throughout the study period, resulting in a high-resolution dataset with 573 distinct taxa across the 785 phytoplankton samples. zooplankton samples (total number of samples = 650) were digitized using the zooscan imaging system (gorsky et al., 2010), with organisms automatically sorted using built-in semi automatic algorithms (random forest and convolutional neural networks) of the ecotaxa platform (picheral et al., 2017). expert taxonomists then reviewed and validated the classifications resulting in 103 taxonomic and morphological groups. individual zooplankton images are accessible through the ecotaxa web platform for further morphometric analyses.bibliographygorsky, g., ohman, m.d., picheral, m., gasparini, s., stemmann, l., romagnan, j.-b., cawood, a., pesant, s., garcia-comas, c., prejger, f., 2010. digital zooplankton image analysis using the zooscan integrated system. j. plankton res. 32, 285–303. https://doi.org/10.1093/plankt/fbp124picheral, m., colin, s., irisson, j.-o., 2017. ecotaxa, a tool for the taxonomic classification of images.worms editorial board, 2025. world register of marine species https://doi.org/10.14284/170dataset contentthe dataset contains three distinct tables all containing both text and numerical data. the first dataset integrates zooplankton measurements with their corresponding environmental parameters and is organised as follows (see also units_pnmi_data_paper.csv):metadata information (columns 1-8):station name (column 1)transect name (column 2)coordinates: longitude and latitude (columns 3-4, in dd.dddd)sampling time: date, year, month, and julian day (columns 5-8)environmental measurements:surface and bottom temperature (columns 9-10, in °c)surface and bottom salinity (columns 11-12, in psu)biological data for each taxonomic group:sample abundance in individuals/m³ (columns 13-116, prefix "conc_" + taxa name)total biovolume in mm³/m³ (columns 117-220, prefix "tot_biov_" + taxa name)mean individual biovolume in mm³ (columns 221-324, prefix "mean_biov_" + taxa name)the second dataset contains phytoplankton data and follows a similar organizational structure:metadata information (columns 1-8):station name (column 1)transect name (column 2)coordinates: longitude and latitude (columns 3-4, in dd.dddd)sampling time: date, year, month, and julian day (columns 5-8)environmental measurements:surface and bottom temperature (columns 9-10, in °c)surface and bottom salinity (columns 11-12, in psu)phytoplankton taxa concentrations:surface abundance in individuals/l (columns 13-580, prefix "surface_" + taxa name)bottom abundance in individuals/l (columns 581-1148, prefix "bottom_" + taxa name)each taxa is provided in the third dataset with the corresponding unique identifier called aphiaid from the world register of marine species (worms editorial board, 2025), which enables unambiguous species identification across databases. for the transect stations (d1 through d6 and b1 through b7), phytoplankton was initially sampled at sub-surface and bottom depths before 2017 (see table 2). following the introduction of ctd profiling in 2017, vertical profiles from 2017-2018 revealed that at offshore stations (b5-b7 and d5-d6), the chlorophyll a maximum, when present, consistently occurred between 15-18 m depth. at coastal stations (up to 40 m deep), strong vertical mixing typically maintained a homogeneous water column with no deep chlorophyll maximum, though when present, it also occurred at approximately 15 m depth. based on these observations, bottom sampling was discontinued in 2019 and replaced with sampling at 15 m depth to better capture phytoplankton biomass.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Information
The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144803 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.
The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.
Structure and content of the dataset
|
ChEMBL ID |
PubChem ID |
IUPHAR ID | Target |
Activity type | Assay type | Unit | Mean C (0) | ... | Mean PC (0) | ... | Mean B (0) | ... | Mean I (0) | ... | Mean PD (0) | ... | Activity check annotation | Ligand names | Canonical SMILES C | ... | Structure check | Source |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.
Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.
Column content:
Facebook
TwitterThe data lists trees maintained by the City of Edinburgh Council. The data set breaks down into the following fields: Column A - Primary Key Column B - Location or Tag no. Column C - Ward Column D - Site Column E - Latin name Column F - Common Name Column G - Owner Column H - NT ref Column I - Height Column J - Spread Column K - Age group Column L - DBH The data is updated on a regular basis, please contact the Open Data team if you are looking for the most up to date version. Additional metadata: - Licence: http://creativecommons.org/licenses/by-nc/2.0/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Related article: Bergroth, C., Järv, O., Tenkanen, H., Manninen, M., Toivonen, T., 2022. A 24-hour population distribution dataset based on mobile phone data from Helsinki Metropolitan Area, Finland. Scientific Data 9, 39.
In this dataset:
We present temporally dynamic population distribution data from the Helsinki Metropolitan Area, Finland, at the level of 250 m by 250 m statistical grid cells. Three hourly population distribution datasets are provided for regular workdays (Mon – Thu), Saturdays and Sundays. The data are based on aggregated mobile phone data collected by the biggest mobile network operator in Finland. Mobile phone data are assigned to statistical grid cells using an advanced dasymetric interpolation method based on ancillary data about land cover, buildings and a time use survey. The data were validated by comparing population register data from Statistics Finland for night-time hours and a daytime workplace registry. The resulting 24-hour population data can be used to reveal the temporal dynamics of the city and examine population variations relevant to for instance spatial accessibility analyses, crisis management and planning.
Please cite this dataset as:
Bergroth, C., Järv, O., Tenkanen, H., Manninen, M., Toivonen, T., 2022. A 24-hour population distribution dataset based on mobile phone data from Helsinki Metropolitan Area, Finland. Scientific Data 9, 39. https://doi.org/10.1038/s41597-021-01113-4
Organization of data
The dataset is packaged into a single Zipfile Helsinki_dynpop_matrix.zip which contains following files:
HMA_Dynamic_population_24H_workdays.csv represents the dynamic population for average workday in the study area.
HMA_Dynamic_population_24H_sat.csv represents the dynamic population for average saturday in the study area.
HMA_Dynamic_population_24H_sun.csv represents the dynamic population for average sunday in the study area.
target_zones_grid250m_EPSG3067.geojson represents the statistical grid in ETRS89/ETRS-TM35FIN projection that can be used to visualize the data on a map using e.g. QGIS.
Column names
YKR_ID : a unique identifier for each statistical grid cell (n=13,231). The identifier is compatible with the statistical YKR grid cell data by Statistics Finland and Finnish Environment Institute.
H0, H1 ... H23 : Each field represents the proportional distribution of the total population in the study area between grid cells during a one-hour period. In total, 24 fields are formatted as “Hx”, where x stands for the hour of the day (values ranging from 0-23). For example, H0 stands for the first hour of the day: 00:00 - 00:59. The sum of all cell values for each field equals to 100 (i.e. 100% of total population for each one-hour period)
In order to visualize the data on a map, the result tables can be joined with the target_zones_grid250m_EPSG3067.geojson data. The data can be joined by using the field YKR_ID as a common key between the datasets.
License Creative Commons Attribution 4.0 International.
Related datasets
Järv, Olle; Tenkanen, Henrikki & Toivonen, Tuuli. (2017). Multi-temporal function-based dasymetric interpolation tool for mobile phone data. Zenodo. https://doi.org/10.5281/zenodo.252612
Tenkanen, Henrikki, & Toivonen, Tuuli. (2019). Helsinki Region Travel Time Matrix [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3247564
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Detailed Description of the Dataset:
The dataset, saved as sign_data.csv, is designed for hand sign recognition and contains comprehensive data captured from hand gestures using real-time video processing. Below is a detailed description of the dataset:
sign_data.csvTools Used: - Mediapipe: For detecting hand landmarks and estimating their positions. - OpenCV: For capturing video frames from a camera.
Functionality:
- Gesture Data Capture: The capture_gesture_data function records hand gestures by processing video frames in real-time. It captures data for a predefined number of rows per gesture, with distances calculated between all pairs of 21 detected hand landmarks.
- Distance Calculation: For each frame, the Euclidean distance between every pair of landmarks is computed, resulting in a comprehensive feature vector for each gesture.
Columns: - Distance Columns: Each distance column represents the calculated distance between a pair of hand landmarks. With 21 landmarks, there are a total of 210 unique distances (computed as ( \frac{21 \times 20}{2} )). - Gesture Label: The final column in the dataset specifies the hand sign label associated with each row of distance measurements (e.g., A, B, C, ..., Z, Space).
Example:
- Column Headers: Distance_0, Distance_1, ..., Distance_209, Sign
- Rows: Each row contains the computed distances followed by the corresponding gesture label.
Gestures Included: - Alphabet: Signs for letters A-Z. - Space: Represents the space gesture.
Number of Samples: Data is collected for each gesture with 100 samples per sign.
The dataset provides detailed spatial information about hand gestures, enabling the training and evaluation of hand sign recognition models. By offering a rich set of distance measurements between hand landmarks, it supports the development of accurate and reliable sign language recognition systems. This dataset is crucial for machine learning applications that aim to bridge communication gaps for individuals with hearing or speech impairments.
Facebook
TwitterThis dataset is currently associated with an article that is in the process of being published. Once the publication process is completed, a reference link will be added separately. Until that time, this dataset cannot be used for any academic purposes.
The dataset contains 196.926 images and 10 csv files.
The images derived from "the Image Matching Challenge PhotoTourism 2020 dataset"
https://www.cs.ubc.ca/~kmyi/imw2020/data.html
The csv files obtained from our work to show a comprehensive comparison of well-known conventional feature extractors/descriptors, including SIFT, SURF, BRIEF, ORB, BRISK, KAZE, AKAZE, FREAK, DAISY, FAST, and STAR.
Just for gaussian blur there is another file to see.
The images folder contains the images utilized for this study and derived ones originated from these images. (as a total 196.926 images)
To use results or codes from this study to nite to cite:
Please cite this to use anything from this dataset or codes: ISIK M. 2024. Comprehensive empirical evaluation of feature extractors in computer vision. PeerJ Computer Science 10:e2415 https://doi.org/10.7717/peerj-cs.2415
THE COLUMN NAMES: img-1 and img-2 stands for the compared image names KP stands for keyPoints, goodMatches_normal stands for matching count with Brute Force Matcher GM stands for percentage goodMatches_knn stands for matching count with kNN Matcher img-1-D-time shows duration time for keyPoints extraction for img-1 img-2-D-time shows duration time for keyPoints extraction for img-2 (compared one) img-1-C-time shows duration time for comparing keyPoints for img-1 img-2-C-time shows duration time for comparing keyPoints for img-2 (compared one) total-D-time is the total of img-1-D-time and img-2-D-time. total-C-time is the total of img-1-C-time and img-2-C-time. matcher-time_normal stands for time duration for matching process with Brute Force Matcher
More explanation will here soon.
Facebook
TwitterFile A is a big data file: File B is a file with already registered users, files C and D are OPT OUT files Goal is to delete everybody from File A that has opt-out or is already registered.
SO, from file A we remove (automatically) ALL lines that contain email addresses that are present in files B, C or D.
After this we will change the columns names a bit to fit into the right format and we are done.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The dataset containing information about passengers aboard the Titanic is one of the most famous datasets used in data science and machine learning. It was created to analyze and understand the factors that influenced survival rates among passengers during the tragic sinking of the RMS Titanic on April 15, 1912.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F19517213%2Fd4016c159f1ad17cb30d8905192fe9d7%2Ftitanic-ship_1027017-11.avif?generation=1711562371875068&alt=media" alt="">
The dataset is often used for predictive modeling and statistical analysis to determine which factors (such as socio-economic status, age, gender, etc.) were associated with a higher likelihood of survival. It contains 1309 rows and 14 columns.
Pclass: Ticket class indicating the socio-economic status of the passenger. It is categorized into three classes: 1 = Upper, 2 = Middle, 3 = Lower.
Survived: A binary indicator that shows whether the passenger survived (1) or not (0) during the Titanic disaster. This is the target variable for analysis.
Name: The full name of the passenger, including title (e.g., Mr., Mrs., etc.).
Sex: The gender of the passenger, denoted as either male or female.
Age: The age of the passenger in years.
SibSp: The number of siblings or spouses aboard the Titanic for the respective passenger.
Parch: The number of parents or children aboard the Titanic for the respective passenger.
Ticket: The ticket number assigned to the passenger.
Fare: The fare paid by the passenger for the ticket.
Cabin: The cabin number assigned to the passenger, if available.
Embarked: The port of embarkation for the passenger. It can take one of three values: C = Cherbourg, Q = Queenstown, S = Southampton.
Boat: If the passenger survived, this column contains the identifier of the lifeboat they were rescued in.
Body: If the passenger did not survive, this column contains the identification number of their recovered body, if applicable.
Home.dest: The destination or place of residence of the passenger.
These descriptions provide a detailed understanding of each column in the Titanic dataset subset, offering insights into the demographic, travel, and survival-related information recorded for each passenger.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a copy of the original Boston Housing Dataset. As of December 2021, the original link doesn't contain the dataset so I'm uploading it if anyone wants to use it. I'll implement a linear regression model to predict the output 'MEDV' variable using PyTorch (check the companion notebook).
I took the data given in this link and processed it to include the column names as well.
https://www.kaggle.com/prasadperera/the-boston-housing-dataset/data
Good luck on your data science career :)