19 datasets found

m
Network traffic and code for machine learning classification
data.mendeley.com
Updated Feb 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Víctor Labayen (2020). Network traffic and code for machine learning classification [Dataset]. http://doi.org/10.17632/5pmnkshffm.2
Explore at:
Unique identifier
https://doi.org/10.17632/5pmnkshffm.2
Dataset updated
Feb 20, 2020
Authors
Víctor Labayen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.

Activities:

Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.

The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.

The amount of data is stated as follows:

Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes

The code of our machine learning approach is also included. There is a README.txt file with the documentation of how to use the code.
UMAHand: Hand Activity Dataset (Universidad de Málaga)
figshare.com
portaldelainvestigacion.uma.es
zip
Updated Jul 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Casilari; Jennifer Barbosa-Galeano; Francisco Javier González-Cañete (2024). UMAHand: Hand Activity Dataset (Universidad de Málaga) [Dataset]. http://doi.org/10.6084/m9.figshare.25638246.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25638246.v3
Dataset updated
Jul 2, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Eduardo Casilari; Jennifer Barbosa-Galeano; Francisco Javier González-Cañete
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The objective of the UMAHand dataset is to provide a systematic, Internet-accessible benchmarking database for evaluating algorithms for the automatic identification of manual activities. The database was created by monitoring 29 predefined activities involving specific movements of the dominant hand. These activities were performed by 25 participants, each completing a certain number of repetitions. During each movement, participants wore a 'mote' or Shimmer sensor device on their dominant hand's wrist. This sensor, comparable in weight and volume to a wristwatch, was attached with an elastic band according to a predetermined orientation.The Shimmer device contains an Inertial Measurement Unit (IMU) with a triaxial accelerometer, gyroscope, magnetometer, and barometer. These sensors recorded measurements of acceleration, angular velocity, magnetic field, and atmospheric pressure at a constant sampling frequency of 100 Hz during each movement.The UMAHand Dataset comprises a main directory and three subdirectories: TRACES (containing measurements), VIDEOS (containing video sequences) and SCRIPTS (with two scripts that automate the downloading, unzipping and processing of the dataset). The main directory also includes three descriptive plain text files and an image:• "readme.txt": this file is a brief guide of the dataset which describes the basic characteristics of the database, the testbed or experimental framework used to generate it and the organization of the data file.• "user_characteristics.txt": which contains a line of six numerical (comma-separated) values for each participant describing their personal characteristics in the following order: 1) an abstract user identifier (a number from 01 to 25), 2) a binary value indicating whether the participant is left-handed (0) or right-handed (1), 3) a numerical value indicating gender: male (0), female (1), undefined or undisclosed (2), 4) the weight in kg, and 5) the height in cm and 6) the age in years.• "activity_description.txt": For each activity, this text file incorporates a line with the activity identifier (numbered from 01 to 29) and an alphanumeric string that briefly describes the performed action.• "sensor_orientation.jpg": a JPEG-type image file illustrating the way the sensor is carried and the orientation of the measurement axes.The TRACE subfolder with the data is, in turn, organized into 25 secondary subfolders, one for each participant, named with the word "output" followed by underscore symbol (_) and the corresponding participant identifier (a number from 1 to 25). Each subdirectory contains one CSV (Comma Separated Values) file for each trial (each repetition of any activity) performed by the corresponding volunteer.The filenames with the monitored data follow the following format: "user_XX_activity_YY_trial_ZZ.csv" where XX, YY, and ZZ represent the identifiers of the participant (XX), the activity (YY) and the repetition number (ZZ), respectively.In the files, which do not include any header, each line corresponds to a sample taken by the sensing node. Thus, each line of the CSV files presents a set of the simultaneous measurements captured by the sensors of the Shimmer mote at a certain instant. The values in each line are arranged as follows:•Timestamp, Ax, Ay, Az, Gx, Gy, Gz, Mx, My, Mz, Pwhere:-Timestamp is the time indication of the moment when the following measurements were taken. Time is measured in milliseconds elapsed since the start of the recording. Therefore, the first sample, in the first line of the file, has a zero value while the rest of the timestamps in the file are relative to this first sample.-Ax, Ay, Az are the measurements of the three axes of the triaxial accelerometer (in g units).-Gx, Gy, Gz indicate the components of the angular velocity measured by the triaxial gyroscope (in degrees per second or dps).-Mx, My, Mz represent the 3-axis data in microteslas (µT) captured by the magnetometer.-P is the measurement of pressure in millibars.Besides, the VIDEOS directory includes 29 anonymized video clips that illustrate with the corresponding examples the 29 manual activities carried out by the participants. The video files are encoded in MPEG4 format and named according to the format "Example_Activity_XX.mp4", where XX indicates the identifier of the movement (as described in the activity_description.txt file).Finally, the SCRIPTS subfolder comprises two scripts written in Python and Matlab. These two programs (named Load_traces), which perform the same function, are designed to automate the downloading and processing of the data. Specifically, these scripts perform the following tasks:1. Download the database from the public repository as a single compressed zip file.2. Unzip the aforementioned file and create the subfolder structure of the dataset in a specific directory named UMAHand_Dataset. As previously commented, in the subfolder named TRACES, one CSV trace file per each experiment (i.e. per each movement, user, and trial) is created.3. Read all the CSV files and store their information in a list of dictionaries (Python) or a matrix of structures (Matlab) named datasetTraces. Each element in that list/matrix has two fields: the filename (which identifies the user, the type of performed activity, and the trial number) and a numerical array of 11 columns containing the timestamps and the measurements of the sensors for that experiment (arranged as mentioned above).All experiments and data acquisition were conducted in private home environments. Participants were asked to perform those activities involving sustained or continuous hand movements (e.g. clapping hands) for at least 10 seconds. In the case of brief and punctual movements, which might require less than 10 seconds (e.g. picking up an object from the floor), volunteers were simply asked to execute the action until its conclusion. Thus, a total of 752 samples were collected, with durations ranging from 1.98 to 119.98 seconds.
n
Repository Analytics and Metrics Portal (RAMP) 2020 data
data.niaid.nih.gov
datadryad.org
zip
Updated Jul 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Wheeler; Kenning Arlitsch (2021). Repository Analytics and Metrics Portal (RAMP) 2020 data [Dataset]. http://doi.org/10.5061/dryad.dv41ns1z4
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.dv41ns1z4
Dataset updated
Jul 23, 2021
Dataset provided by
University of New Mexico
Montana State University
Authors
Jonathan Wheeler; Kenning Arlitsch
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Version update: The originally uploaded versions of the CSV files in this dataset included an extra column, "Unnamed: 0," which is not RAMP data and was an artifact of the process used to export the data to CSV format. This column has been removed from the revised dataset. The data are otherwise the same as in the first version.

The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2020. For a description of the data collection, processing, and output methods, please see the "methods" section below.

Methods Data Collection

RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

Data are downloaded in two sets per participating IR. The first set includes page level statistics about URLs pointing to IR pages and content files. The following fields are downloaded for each URL, with one row per URL:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search.

Following data processing describe below, on ingest into RAMP a additional field, citableContent, is added to the page level data.

The second set includes similar information, but instead of being aggregated at the page level, the data are grouped based on the country from which the user submitted the corresponding search, and the type of device used. The following fields are downloaded for combination of country and device, with one row per country/device combination:

country: The country from which the corresponding search originated. device: The device used for the search. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search.

Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

Data Processing

Upon download from GSC, the page level data described above are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of page level statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the page level data which records whether each page/URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

The data aggregated by the search country of origin and device type do not include URLs. No additional processing is done on these data. Harvested data are passed directly into Elasticsearch.

Processed data are then saved in a series of Elasticsearch indices. Currently, RAMP stores data in two indices per participating IR. One index includes the page level data, the second index includes the country of origin and device type data.

About Citable Content Downloads

Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.

CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).

For any specified date range, the steps to calculate CCD are:

Filter data to only include rows where "citableContent" is set to "Yes." Sum the value of the "clicks" field on these rows.

Output to CSV

Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above. Also as noted above, daily data are downloaded for each IR in two sets which cannot be combined. One dataset includes the URLs of items that appear in SERP. The second dataset is aggregated by combination of the country from which a search was conducted and the device used.

As a result, two CSV datasets are provided for each month of published data:

page-clicks:

The data in these CSV files correspond to the page-level data, and include the following fields:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search. citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No. index: The Elasticsearch index corresponding to page click data for a single IR. repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.

Filenames for files containing these data end with “page-clicks”. For example, the file named 2020-01_RAMP_all_page-clicks.csv contains page level click data for all RAMP participating IR for the month of January, 2020.

country-device-info:

The data in these CSV files correspond to the data aggregated by country from which a search was conducted and the device used. These include the following fields:

country: The country from which the corresponding search originated. device: The device used for the search. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search. index: The Elasticsearch index corresponding to country and device access information data for a single IR. repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.

Filenames for files containing these data end with “country-device-info”. For example, the file named 2020-01_RAMP_all_country-device-info.csv contains country and device data for all participating IR for the month of January, 2020.

References

Google, Inc. (2021). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original.
n
Repository Analytics and Metrics Portal (RAMP) 2019 data
data.niaid.nih.gov
search.dataone.org
+2more
zip
Updated Jul 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Wheeler; Kenning Arlitsch (2021). Repository Analytics and Metrics Portal (RAMP) 2019 data [Dataset]. http://doi.org/10.5061/dryad.crjdfn342
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.crjdfn342
Dataset updated
Jul 14, 2021
Dataset provided by
University of New Mexico
Montana State University
Authors
Jonathan Wheeler; Kenning Arlitsch
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Version update: The originally uploaded versions of the CSV files in this dataset included an extra column, "Unnamed: 0," which is not RAMP data and was an artifact of the process used to export the data to CSV format. This column has been removed from the revised dataset. The data are otherwise the same as in the first version.

The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2019. For a description of the data collection, processing, and output methods, please see the "methods" section below.

Methods

Data Collection

RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

Data are downloaded in two sets per participating IR. The first set includes page level statistics about URLs pointing to IR pages and content files. The following fields are downloaded for each URL, with one row per URL:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search.

Following data processing describe below, on ingest into RAMP a additional field, citableContent, is added to the page level data.

The second set includes similar information, but instead of being aggregated at the page level, the data are grouped based on the country from which the user submitted the corresponding search, and the type of device used. The following fields are downloaded for combination of country and device, with one row per country/device combination:

country: The country from which the corresponding search originated. device: The device used for the search. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search.

Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

Data Processing

Upon download from GSC, the page level data described above are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of page level statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the page level data which records whether each page/URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

The data aggregated by the search country of origin and device type do not include URLs. No additional processing is done on these data. Harvested data are passed directly into Elasticsearch.

Processed data are then saved in a series of Elasticsearch indices. Currently, RAMP stores data in two indices per participating IR. One index includes the page level data, the second index includes the country of origin and device type data.

About Citable Content Downloads

Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.

CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).

For any specified date range, the steps to calculate CCD are:

Filter data to only include rows where "citableContent" is set to "Yes." Sum the value of the "clicks" field on these rows.

Output to CSV

Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above. Also as noted above, daily data are downloaded for each IR in two sets which cannot be combined. One dataset includes the URLs of items that appear in SERP. The second dataset is aggregated by combination of the country from which a search was conducted and the device used.

As a result, two CSV datasets are provided for each month of published data:

page-clicks:

The data in these CSV files correspond to the page-level data, and include the following fields:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search. citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No. index: The Elasticsearch index corresponding to page click data for a single IR. repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.

Filenames for files containing these data end with “page-clicks”. For example, the file named 2019-01_RAMP_all_page-clicks.csv contains page level click data for all RAMP participating IR for the month of January, 2019.

country-device-info:

The data in these CSV files correspond to the data aggregated by country from which a search was conducted and the device used. These include the following fields:

country: The country from which the corresponding search originated. device: The device used for the search. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search. index: The Elasticsearch index corresponding to country and device access information data for a single IR. repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.

Filenames for files containing these data end with “country-device-info”. For example, the file named 2019-01_RAMP_all_country-device-info.csv contains country and device data for all participating IR for the month of January, 2019.

References

Google, Inc. (2021). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original.
n
Data from: Repository Analytics and Metrics Portal (RAMP) 2021 data
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Wheeler; Kenning Arlitsch (2023). Repository Analytics and Metrics Portal (RAMP) 2021 data [Dataset]. http://doi.org/10.5061/dryad.1rn8pk0tz
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.1rn8pk0tz
Dataset updated
May 23, 2023
Dataset provided by
University of New Mexico
Montana State University
Authors
Jonathan Wheeler; Kenning Arlitsch
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2021. For a description of the data collection, processing, and output methods, please see the "methods" section below.

The record will be revised periodically to make new data available through the remainder of 2021.

Methods

Data Collection

RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

Data are downloaded in two sets per participating IR. The first set includes page level statistics about URLs pointing to IR pages and content files. The following fields are downloaded for each URL, with one row per URL:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search.

Following data processing describe below, on ingest into RAMP a additional field, citableContent, is added to the page level data.

The second set includes similar information, but instead of being aggregated at the page level, the data are grouped based on the country from which the user submitted the corresponding search, and the type of device used. The following fields are downloaded for combination of country and device, with one row per country/device combination:

country: The country from which the corresponding search originated. device: The device used for the search. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search.

Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

Data Processing

Upon download from GSC, the page level data described above are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of page level statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the page level data which records whether each page/URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

The data aggregated by the search country of origin and device type do not include URLs. No additional processing is done on these data. Harvested data are passed directly into Elasticsearch.

Processed data are then saved in a series of Elasticsearch indices. Currently, RAMP stores data in two indices per participating IR. One index includes the page level data, the second index includes the country of origin and device type data.

About Citable Content Downloads

Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.

CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).

For any specified date range, the steps to calculate CCD are:

Filter data to only include rows where "citableContent" is set to "Yes." Sum the value of the "clicks" field on these rows.

Output to CSV

Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above. Also as noted above, daily data are downloaded for each IR in two sets which cannot be combined. One dataset includes the URLs of items that appear in SERP. The second dataset is aggregated by combination of the country from which a search was conducted and the device used.

As a result, two CSV datasets are provided for each month of published data:

page-clicks:

The data in these CSV files correspond to the page-level data, and include the following fields:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search. citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No. index: The Elasticsearch index corresponding to page click data for a single IR. repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.

Filenames for files containing these data end with “page-clicks”. For example, the file named 2021-01_RAMP_all_page-clicks.csv contains page level click data for all RAMP participating IR for the month of January, 2021.

country-device-info:

The data in these CSV files correspond to the data aggregated by country from which a search was conducted and the device used. These include the following fields:

country: The country from which the corresponding search originated. device: The device used for the search. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search. index: The Elasticsearch index corresponding to country and device access information data for a single IR. repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.

Filenames for files containing these data end with “country-device-info”. For example, the file named 2021-01_RAMP_all_country-device-info.csv contains country and device data for all participating IR for the month of January, 2021.

References

Google, Inc. (2021). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original.
Walker Fall Detection Dataset
kaggle.com
Updated Sep 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
agarciag (2023). Walker Fall Detection Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/6493939
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/6493939
Dataset updated
Sep 18, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
agarciag
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

The Walker Fall Detection Data Set is a curated compilation of inertial data designed for the study of fall detection systems, specifically for people using walking assistance. This data set offers deep insight into various movement patterns. It covers data from four different classes: idle, motion, step and fall.

This dataset was published as part of a research paper: Dataset and System Design for Orthopedic Walker Fall Detection and Activity Logging Using Motion Classification

Data Acquisition

Data was recorded using an IMU affixed to a walker, as illustrated in the image below:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15943143%2F134aaf57e284d20775d37ecdfa14d30e%2F20230712_163957.jpg?generation=1695049359163422&alt=media" alt="Walker">

The IMU used for this project is the Arduino Nano 33 BLE Sense. It's powered by a LiPo battery and is equipped with a voltage regulator and a dedicated battery charging circuit. To ensure durability and protection during the data recording phase, the entire prototype was securely housed in a custom 3D-printed casing.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15943143%2F0c8458d7bd161e983e1676e72d89085d%2F20230712_163957_2.jpg?generation=1695050869941377&alt=media" alt="">

The prototype was designed to transmit data wirelessly to a computer using Bluetooth Low Energy (BLE). Upon receipt, a Python script processed the incoming data and stored it in JSON format. The data transmission rate was optimized to achieve the highest possible rate, resulting in approximately 100 samples per second, covering both accelerometer and gyroscope data.

Data were collected from four different subjects, each of whom maneuvered the walker down a hallway, primarily capturing step and movement data. It is important to note that the “**idle**” data are not subject-specific, as it represents periods in which the walker is stationary. Similarly, “**fall**” data is also not linked to any particular individual; was obtained by deliberately pushing the walker from a vertical position to the ground.

Data Processing

This dataset contains four classes:

Idle: Represents periods of no movement, indicating that the walker is stationary.

Step: Capture moments when an individual takes a step using the walker.

Movement: Covers any movement that is not considered a step or a fall, such as when carrying the walker or adjusting its position.

Fall: Denotes cases in which the walker tips over and falls to the ground from an upright position.

To effectively categorize the data, several processing steps were executed. Initially, the data was reduced from its original 100 samples per second to ensure a constant time step between samples, since the original rate was not uniformly constant. After this, both the acceleration and gyro data were normalized to one sample every 12.5 milliseconds, resulting in a rate of 80 samples per second. This normalization allowed the synchronization of acceleration and gyroscope data, which were subsequently stored in dictionaries in JSON format. Each dictionary contains the six dimensions (three acceleration and three gyro) corresponding to a specific timestamp.

To distinguish individual samples within each group, the root mean square (RMS) value of the six dimensions (comprising acceleration and gyroscope data) was calculated. Subsequently, an algorithm based on the hidden Markov model (HMM) was used to discern the hidden states inherent in the data, which facilitated the segmentation of the data set.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15943143%2F250182bc7b0c3ca0383ea79ac6e3224a%2Fhmm.jpg?generation=1695059033833835&alt=media" alt="">

Through the filtering process, the HMM effectively identifies individual steps. Once all steps were identified, the window size was determined based on the duration of each step. A window size of 160 samples was chosen, which, given a rate of 80 samples per second, is equivalent to a duration of 2 seconds for each sample.

A similar procedure is employed to extract "fall" samples. However, for "idle" and "motion" samples, isolation isn't necessary. Instead, samples from these categories can be arbitrarily chosen from the recorded clusters.

Final Dataset The finalized dataset is presented in CSV format. The first column serves as the label column and covers all four classes. In addition to this, the CSV file has 960 columns of functions. These columns encapsulate 160 samples each of acceleration and gyro data in the x, y, and z axes.

Each class contains 620 samples, bringing the overall total to 2480 samples across all classes.

Citation and Use

This dataset is associated with a research article currently un...
f
Cleaned NHANES 1988-2018
figshare.com
txt
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21743372.v9
Dataset updated
Feb 18, 2025
Dataset provided by
figshare
Authors
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
SUBF v1.0 Dataset: Bearing Fault Vibration Data
kaggle.com
zip
Updated Nov 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumair Aziz (2024). SUBF v1.0 Dataset: Bearing Fault Vibration Data [Dataset]. https://www.kaggle.com/datasets/sumairaziz/subf-v1-0-dataset-bearing-fault-vibration-data
Explore at:
zip(1707586602 bytes)Available download formats
Dataset updated
Nov 19, 2024
Authors
Sumair Aziz
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
**SUBF Dataset v1.0: Bearing Fault Diagnosis using Vibration Signals **

Description The SUBF dataset v1.0 has been designed for the analysis and diagnosis of mechanical bearing faults. The mechanical setup consists of a motor, a frame/base, bearings, and a shaft, simulating different machine conditions such as a healthy state, inner race fault, and outer race fault. This dataset aims to facilitate reproducibility and support research in mechanical fault diagnosis and machine condition monitoring.

The dataset is part of the research paper "Aziz, S., Khan, M. U., Faraz, M., & Montes, G. A. (2023). Intelligent bearing faults diagnosis featuring automated relative energy-based empirical mode decomposition and novel cepstral autoregressive features. Measurement, 216, 112871." DOI: https://doi.org/10.1016/j.measurement.2023.112871

The dataset can be used with MATLAB and Python.

Experimental Setup Motor: A 3-phase AC motor, 0.25 HP, operating at 1440 RPM, 50 Hz frequency, and 440 Volts. Target Bearings: The left-side bearing was replaced to represent three categories: - Normal Bearings - Inner Race Fault Bearings - Outer Race Fault Bearings

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7973470%2Fd6325d79c6be2747f8c2368d28b4000f%2Fs1.png?generation=1732113324463780&alt=media" alt="">

Instrumentation - Sensor: BeanDevice 2.4 GHz AX-3D, a wireless vibration sensor, was used to record vibration data. - Recording: Data collected via BeanGateway and stored on a PC. - Sampling: 1000 Hz.

Data Acquisition - Duration: 18 hours of data collection (6 hours per class). - Segmenting: Signals were divided into 10-second segments, resulting in 2160 signals for each fault category. - Classes: Healthy state, inner race fault, and outer race fault.

Dataset Organization The dataset is structured as follows: Main Folder: Contains two subfolders for .mat and .csv file formats to accommodate different user preferences.

Subfolder 1: .mat Files Healthy: Contains .mat files representing vibration signals for the healthy state. Inner Race Fault: Contains .mat files representing vibration signals for bearings with an inner race fault. Outer Race Fault: Contains .mat files representing vibration signals for bearings with an outer race fault.

Subfolder 2: .csv Files Healthy: Contains .csv files representing vibration signals for the healthy state. Inner Race Fault: Contains .csv files representing vibration signals for bearings with an inner race fault. Outer Race Fault: Contains .csv files representing vibration signals for bearings with an outer race fault.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7973470%2F59b468d1431202f361679b0a99d328da%2Fs3.png?generation=1732113352733560&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7973470%2F34db849fa30587a9f37491cade3258ac%2Fs2.png?generation=1732113382823548&alt=media" alt="">

Applications This dataset is suitable for tasks such as: Fault detection and diagnosis Signal processing and feature extraction research Development and benchmarking of machine learning and deep learning models

Usage This dataset can be used for academic research, industrial fault diagnosis applications, and algorithm development. Please cite the following reference when using this dataset: Aziz, S., Khan, M. U., Faraz, M., & Montes, G. A. (2023). Intelligent bearing faults diagnosis featuring automated relative energy based empirical mode decomposition and novel cepstral autoregressive features. Measurement, 216, 112871." DOI: https://doi.org/10.1016/j.measurement.2023.112871

Licence This dataset is made publicly available for research purposes. Ensure appropriate citation and credit when using the data.https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7973470%2F1a76f7de5a0ca312ddc2d9ed0caf99a5%2Fs2.png?generation=1732113357938012&alt=media" alt="">
n
Repository Analytics and Metrics Portal (RAMP) 2018 data
data.niaid.nih.gov
dataone.org
+2more
zip
Updated Jul 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Wheeler; Kenning Arlitsch (2021). Repository Analytics and Metrics Portal (RAMP) 2018 data [Dataset]. http://doi.org/10.5061/dryad.ffbg79cvp
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.ffbg79cvp
Dataset updated
Jul 27, 2021
Dataset provided by
University of New Mexico
Montana State University
Authors
Jonathan Wheeler; Kenning Arlitsch
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2018. For a description of the data collection, processing, and output methods, please see the "methods" section below. Note that the RAMP data model changed in August, 2018 and two sets of documentation are provided to describe data collection and processing before and after the change.

Methods

RAMP Data Documentation – January 1, 2017 through August 18, 2018

Data Collection

RAMP data were downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

Data from January 1, 2017 through August 18, 2018 were downloaded in one dataset per participating IR. The following fields were downloaded for each URL, with one row per URL:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. country: The country from which the corresponding search originated. device: The device used for the search. date: The date of the search.

Following data processing describe below, on ingest into RAMP an additional field, citableContent, is added to the page level data.

Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

Data Processing

Upon download from GSC, data are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the data which records whether each URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

Processed data are then saved in a series of Elasticsearch indices. From January 1, 2017, through August 18, 2018, RAMP stored data in one index per participating IR.

About Citable Content Downloads

Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.

CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).

For any specified date range, the steps to calculate CCD are:

Filter data to only include rows where "citableContent" is set to "Yes." Sum the value of the "clicks" field on these rows.

Output to CSV

Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above.

The data in these CSV files include the following fields:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. country: The country from which the corresponding search originated. device: The device used for the search. date: The date of the search. citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No. index: The Elasticsearch index corresponding to page click data for a single IR. repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the index field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.

Filenames for files containing these data follow the format 2018-01_RAMP_all.csv. Using this example, the file 2018-01_RAMP_all.csv contains all data for all RAMP participating IR for the month of January, 2018.

Data Collection from August 19, 2018 Onward

RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

Data are downloaded in two sets per participating IR. The first set includes page level statistics about URLs pointing to IR pages and content files. The following fields are downloaded for each URL, with one row per URL:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search.

Following data processing describe below, on ingest into RAMP a additional field, citableContent, is added to the page level data.

The second set includes similar information, but instead of being aggregated at the page level, the data are grouped based on the country from which the user submitted the corresponding search, and the type of device used. The following fields are downloaded for combination of country and device, with one row per country/device combination:

country: The country from which the corresponding search originated. device: The device used for the search. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. date: The date of the search.

Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

Data Processing

Upon download from GSC, the page level data described above are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of page level statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the page level data which records whether each page/URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

The data aggregated by the search country of origin and device type do not include URLs. No additional processing is done on these data. Harvested data are passed directly into Elasticsearch.

Processed data are then saved in a series of Elasticsearch indices. Currently, RAMP stores data in two indices per participating IR. One index includes the page level data, the second index includes the country of origin and device type data.

About Citable Content Downloads

Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository
u
Dr. Duke's Phytochemical and Ethnobotanical Databases
agdatacommons.nal.usda.gov
catalog.data.gov
zip
Updated Nov 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James A. Duke (2023). Dr. Duke's Phytochemical and Ethnobotanical Databases [Dataset]. http://doi.org/10.15482/USDA.ADC/1239279
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1239279
Dataset updated
Nov 30, 2023
Dataset provided by
Ag Data Commons
Authors
James A. Duke
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Of interest to pharmaceutical, nutritional, and biomedical researchers, as well as individuals and companies involved with alternative therapies and and herbal products, this database is one of the world's leading repositories of ethnobotanical data, evolving out of the extensive compilations by the former Chief of USDA's Economic Botany Laboratory in the Agricultural Research Service in Beltsville, Maryland, in particular his popular Handbook of phytochemical constituents of GRAS herbs and other economic plants (CRC Press, Boca Raton, FL, 1992). In addition to Duke's own publications, the database documents phytochemical information and quantitative data collected over many years through research results presented at meetings and symposia, and findings from the published scientific literature. The current Phytochemical and Ethnobotanical databases facilitate plant, chemical, bioactivity, and ethnobotany searches. A large number of plants and their chemical profiles are covered, and data are structured to support browsing and searching in several user-focused ways. For example, users can

get a list of chemicals and activities for a specific plant of interest, using either its scientific or common name download a list of chemicals and their known activities in PDF or spreadsheet form find plants with chemicals known for a specific biological activity display a list of chemicals with their LD toxicity data find plants with potential cancer-preventing activity display a list of plants for a given ethnobotanical use find out which plants have the highest levels of a specific chemical

References to the supporting scientific publications are provided for each specific result. Resources in this dataset:Resource Title: Duke-Source-CSV.zip. File Name: Duke-Source-CSV.zipResource Description: Dr. Duke's Phytochemistry and Ethnobotany - raw database tables for archival purposes. Visit https://phytochem.nal.usda.gov/phytochem/search for the interactive web version of the database.Resource Title: Data Dictionary (preliminary). File Name: DrDukesDatabaseDataDictionary-prelim.csvResource Description: This Data Dictionary describes the columns for each table. [Note that this is in progress and some variables are yet to be defined or are unused in the current implementation. Please send comments/suggestions to nal-adc-curator@ars.usda.gov ]
Customer Shopping Trends Dataset
kaggle.com
Updated Oct 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sourav Banerjee (2023). Customer Shopping Trends Dataset [Dataset]. https://www.kaggle.com/datasets/iamsouravbanerjee/customer-shopping-trends-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 5, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sourav Banerjee
Description
Context

The Customer Shopping Preferences Dataset offers valuable insights into consumer behavior and purchasing patterns. Understanding customer preferences and trends is critical for businesses to tailor their products, marketing strategies, and overall customer experience. This dataset captures a wide range of customer attributes including age, gender, purchase history, preferred payment methods, frequency of purchases, and more. Analyzing this data can help businesses make informed decisions, optimize product offerings, and enhance customer satisfaction. The dataset stands as a valuable resource for businesses aiming to align their strategies with customer needs and preferences. It's important to note that this dataset is a Synthetic Dataset Created for Beginners to learn more about Data Analysis and Machine Learning.

Content

This dataset encompasses various features related to customer shopping preferences, gathering essential information for businesses seeking to enhance their understanding of their customer base. The features include customer age, gender, purchase amount, preferred payment methods, frequency of purchases, and feedback ratings. Additionally, data on the type of items purchased, shopping frequency, preferred shopping seasons, and interactions with promotional offers is included. With a collection of 3900 records, this dataset serves as a foundation for businesses looking to apply data-driven insights for better decision-making and customer-centric strategies.

Dataset Glossary (Column-wise)

Customer ID - Unique identifier for each customer

Age - Age of the customer

Gender - Gender of the customer (Male/Female)

Item Purchased - The item purchased by the customer

Category - Category of the item purchased

Purchase Amount (USD) - The amount of the purchase in USD

Location - Location where the purchase was made

Size - Size of the purchased item

Color - Color of the purchased item

Season - Season during which the purchase was made

Review Rating - Rating given by the customer for the purchased item

Subscription Status - Indicates if the customer has a subscription (Yes/No)

Shipping Type - Type of shipping chosen by the customer

Discount Applied - Indicates if a discount was applied to the purchase (Yes/No)

Promo Code Used - Indicates if a promo code was used for the purchase (Yes/No)

Previous Purchases - The total count of transactions concluded by the customer at the store, excluding the ongoing transaction

Payment Method - Customer's most preferred payment method

Frequency of Purchases - Frequency at which the customer makes purchases (e.g., Weekly, Fortnightly, Monthly)

Structure of the Dataset

https://i.imgur.com/6UEqejq.png" alt="">

Acknowledgement

This dataset is a synthetic creation generated using ChatGPT to simulate a realistic customer shopping experience. Its purpose is to provide a platform for beginners and data enthusiasts, allowing them to create, enjoy, practice, and learn from a dataset that mirrors real-world customer shopping behavior. The aim is to foster learning and experimentation in a simulated environment, encouraging a deeper understanding of data analysis and interpretation in the context of consumer preferences and retail scenarios.

Cover Photo by: Freepik

Thumbnail by: Clothing icons created by Flat Icons - Flaticon
n
Repository Analytics and Metrics Portal (RAMP) 2017 data
data.niaid.nih.gov
datadryad.org
zip
Updated Jul 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Wheeler; Kenning Arlitsch (2021). Repository Analytics and Metrics Portal (RAMP) 2017 data [Dataset]. http://doi.org/10.5061/dryad.r7sqv9scf
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.r7sqv9scf
Dataset updated
Jul 27, 2021
Dataset provided by
University of New Mexico
Montana State University
Authors
Jonathan Wheeler; Kenning Arlitsch
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2017. For a description of the data collection, processing, and output methods, please see the "methods" section below.

Methods RAMP Data Documentation – January 1, 2017 through August 18, 2018

Data Collection

RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

Data from January 1, 2017 through August 18, 2018 were downloaded in one dataset per participating IR. The following fields were downloaded for each URL, with one row per URL:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. country: The country from which the corresponding search originated. device: The device used for the search. date: The date of the search.

Following data processing describe below, on ingest into RAMP an additional field, citableContent, is added to the page level data.

Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

Data Processing

Upon download from GSC, data are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the data which records whether each URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

Processed data are then saved in a series of Elasticsearch indices. From January 1, 2017, through August 18, 2018, RAMP stored data in one index per participating IR.

About Citable Content Downloads

Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.

CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).

For any specified date range, the steps to calculate CCD are:

Filter data to only include rows where "citableContent" is set to "Yes." Sum the value of the "clicks" field on these rows.

Output to CSV

Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above.

The data in these CSV files include the following fields:

url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. impressions: The number of times the URL appears within the SERP. clicks: The number of clicks on a URL which took users to a page outside of the SERP. clickThrough: Calculated as the number of clicks divided by the number of impressions. position: The position of the URL within the SERP. country: The country from which the corresponding search originated. device: The device used for the search. date: The date of the search. citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No. index: The Elasticsearch index corresponding to page click data for a single IR. repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the index field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.

Filenames for files containing these data follow the format 2017-01_RAMP_all.csv. Using this example, the file 2017-01_RAMP_all.csv contains all data for all RAMP participating IR for the month of January, 2017.

References

Google, Inc. (2021). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original.
d
National Monuments Service - Archaeological Survey of Ireland
datasalsa.com
data.gov.ie
csv, feature service +2
Updated Apr 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Housing, Local Government and Heritage (2024). National Monuments Service - Archaeological Survey of Ireland [Dataset]. https://datasalsa.com/dataset/?catalogue=data.gov.ie&name=national-monuments-service-archaeological-survey-of-ireland
Explore at:
feature service, html, shp, csvAvailable download formats
Dataset updated
Apr 7, 2024
Dataset authored and provided by
Department of Housing, Local Government and Heritage
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 14, 2025
Area covered
Ireland
Description
National Monuments Service - Archaeological Survey of Ireland. Published by Department of Housing, Local Government and Heritage. Available under the license Creative Commons Attribution 4.0 (CC-BY-4.0).This Archaeological Survey of Ireland dataset is published from the database of the National Monuments Service Sites and Monuments Record (SMR). This dataset also can be viewed and interrogated through the online Historic Environment Viewer: https://heritagedata.maps.arcgis.com/apps/webappviewer/index.html?id=0c9eb9575b544081b0d296436d8f60f8

A Sites and Monuments Record (SMR) was issued for all counties in the State between 1984 and 1992. The SMR is a manual containing a numbered list of certain and possible monuments accompanied by 6-inch Ordnance Survey maps (at a reduced scale). The SMR formed the basis for issuing the Record of Monuments and Places (RMP) - the statutory list of recorded monuments established under Section 12 of the National Monuments (Amendment) Act 1994. The RMP was issued for each county between 1995 and 1998 in a similar format to the existing SMR. The RMP differs from the earlier lists in that, as defined in the Act, only monuments with known locations or places where there are believed to be monuments are included.

The large Archaeological Survey of Ireland archive and supporting database are managed by the National Monuments Service and the records are continually updated and supplemented as additional monuments are discovered. On the Historic Environment viewer an area around each monument has been shaded, the scale of which varies with the class of monument. This area does not define the extent of the monument, nor does it define a buffer area beyond which ground disturbance should not take place – it merely identifies an area of land within which it is expected that the monument will be located. It is not a constraint area for screening – such must be set by the relevant authority who requires screening for their own purposes. This data has been released for download as Open Data under the DPER Open Data Strategy and is licensed for re-use under the Creative Commons Attribution 4.0 International licence. http://creativecommons.org/licenses/by/4.0

Please note that the centre point of each record is not indicative of the geographic extent of the monument. The existing point centroids were digitised relative to the OSI 6-inch mapping and the move from this older IG-referenced series to the larger-scale ITM mapping will necessitate revisions. The accuracy of the derived ITM co-ordinates is limited to the OS 6-inch scale and errors may ensue should the user apply the co-ordinates to larger scale maps. Records that do not refer to 'monuments' are designated 'Redundant record' and are retained in the archive as they may relate to features that were once considered to be monuments but which on investigation proved otherwise. Redundant records may also refer to duplicate records or errors in the data structure of the Archaeological Survey of Ireland.

This dataset is provided for re-use in a number of ways and the technical options are outlined below. For a live and current view of the data, please use the web services or the data extract tool in the Historic Environment Viewer. The National Monuments Service also provide an Open Data snapshot of its national dataset in CSV as a bulk data download. Users should consult the National Monument Service website https://www.archaeology.ie/ for further information and guidance on the National Monument Act(s) and the legal significance of this dataset.

Open Data Bulk Data Downloads (version date: 23/08/2023)

The Sites and Monuments Record (SMR) is provided as a national download in Comma Separated Value (CSV) format. This format can be easily integrated into a number of software clients for re-use and analysis. The Longitude and Latitude coordinates are also provided to aid its re-use in web mapping systems, however, the ITM easting/northings coordinates should be quoted for official purposes. ERSI Shapefiles of the SMR points and SMRZone polygons are also available The SMRZones represent an area around each monument, the scale of which varies with the class of monument. This area does not define the extent of the monument, nor does it define a buffer area beyond which ground disturbance should not take place – it merely identifies an area of land within which it is expected that the monument will be located. It is not a constraint area for screening – such must be set by the relevant authority who requires screening for their own purposes.

GIS Web Service APIs (live views):

For users with access to GIS software please note that the Archaeological Survey of Ireland data is also available spatial data web services. By accessing and consuming the web service users are deemed to have accepted the Terms and Conditions. The web services are available at the URL endpoints advertised below:

SMR; https://services-eu1.arcgis.com/HyjXgkV6KGMSF3jt/arcgis/rest/services/SMROpenData/FeatureServer

SMRZone; https://services-eu1.arcgis.com/HyjXgkV6KGMSF3jt/arcgis/rest/services/SMRZoneOpenData/FeatureServer

Historic Environment Viewer - Query Tool

The "Query" tool can alternatively be used to selectively filter and download the data represented in the Historic Environment Viewer. The instructions for using this tool in the Historic Environment Viewer are detailed in the associated Help file: https://www.archaeology.ie/sites/default/files/media/pdf/HEV_UserGuide_v01.pdf...
Open Data Portal Catalogue
open.canada.ca
datasets.ai
+1more
csv, json, jsonl, png +2
Updated Jun 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Treasury Board of Canada Secretariat (2025). Open Data Portal Catalogue [Dataset]. https://open.canada.ca/data/en/dataset/c4c5c7f1-bfa6-4ff6-b4a0-c164cb2060f7
Explore at:
csv, sqlite, json, png, jsonl, xlsxAvailable download formats
Dataset updated
Jun 14, 2025
Dataset provided by
Treasury Board of Canadahttps://www.canada.ca/en/treasury-board-secretariat/corporate/about-treasury-board.html
Treasury Board of Canada Secretariathttp://www.tbs-sct.gc.ca/
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
The open data portal catalogue is a downloadable dataset containing some key metadata for the general datasets available on the Government of Canada's Open Data portal. Resource 1 is generated using the ckanapi tool (external link) Resources 2 - 8 are generated using the Flatterer (external link) utility. ###Description of resources: 1. Dataset is a JSON Lines (external link) file where the metadata of each Dataset/Open Information Record is one line of JSON. The file is compressed with GZip. The file is heavily nested and recommended for users familiar with working with nested JSON. 2. Catalogue is a XLSX workbook where the nested metadata of each Dataset/Open Information Record is flattened into worksheets for each type of metadata. 3. datasets metadata contains metadata at the dataset level. This is also referred to as the package in some CKAN documentation. This is the main table/worksheet in the SQLite database and XLSX output. 4. Resources Metadata contains the metadata for the resources contained within each dataset. 5. resource views metadata contains the metadata for the views applied to each resource, if a resource has a view configured. 6. datastore fields metadata contains the DataStore information for CSV datasets that have been loaded into the DataStore. This information is displayed in the Data Dictionary for DataStore enabled CSVs. 7. Data Package Fields contains a description of the fields available in each of the tables within the Catalogue, as well as the count of the number of records each table contains. 8. data package entity relation diagram Displays the title and format for column, in each table in the Data Package in the form of a ERD Diagram. The Data Package resource offers a text based version. 9. SQLite Database is a .db database, similar in structure to Catalogue. This can be queried with database or analytical software tools for doing analysis.
P
Kitsune Network Attack Dataset Dataset
paperswithcode.com
Updated Jan 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yisroel Mirsky; Tomer Doitshman; Yuval Elovici; Asaf Shabtai (2024). Kitsune Network Attack Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/kitsune-network-attack-dataset
Explore at:
Dataset updated
Jan 7, 2024
Authors
Yisroel Mirsky; Tomer Doitshman; Yuval Elovici; Asaf Shabtai
Description
Kitsune Network Attack Dataset This is a collection of nine network attack datasets captured from a either an IP-based commercial surveillance system or a network full of IoT devices. Each dataset contains millions of network packets and diffrent cyber attack within it.

For each attack, you are supplied with:

A preprocessed dataset in csv format (ready for machine learning) The corresponding label vector in csv format The original network capture in pcap format (in case you want to engineer your own features)

We will now describe in detail what's in these datasets and how they were collected.

The Network Attacks We have collected a wide variety of attacks which you would find in a real network intrusion. The following is a list of the cyber attack datasets avalaible:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F827271%2F79e305668553e521b0709a2413323c45%2Fkaggle_dataset_table.png?generation=1598461684070844&alt=media" alt="image" width="100">

For more details on the attacks themselves, please refer to our NDSS paper (citation below).

The Data Collection The following figure presents the network topologies which we used to collect the data, and the corrisponding attack vectors at which the attacks were performed. The network capture took place at point 1 and point X at the router (where a network intrusion detection system could feasibly be placed). For each dataset, clean network traffic was captured for the first 1 million packets, then the cyber attack was performed.

The Dataset Format Each preprocessed dataset csv has m rows (packets) and 115 columns (features) with no header. The 115 features were extracted using our AfterImage feature extractor, described in our NDSS paper (see below) and available in Python here. In summary, the 115 features provide a statistical snapshot of the network (hosts and behaviors) in the context of the current packet traversing the network. The AfterImage feature extractor is unique in that it can efficiently process millions of streams (network channels) in real-time, incrementally, making it suitable for handling network traffic.

Citation If you use these datasets, please cite:

@inproceedings{mirsky2018kitsune, title={Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection}, author={Mirsky, Yisroel and Doitshman, Tomer and Elovici, Yuval and Shabtai, Asaf}, booktitle={The Network and Distributed System Security Symposium (NDSS) 2018}, year={2018} }
Data for activity recognition
kaggle.com
zip
Updated Mar 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oleksandr Kosovan (2020). Data for activity recognition [Dataset]. https://www.kaggle.com/kosovanolexandr/data-for-activity-recognition
Explore at:
zip(4526717 bytes)Available download formats
Dataset updated
Mar 23, 2020
Authors
Oleksandr Kosovan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

What is my goal?

I want to collect accelerometer data from my smart phone. And after that I want to create model, which will predict class of my activity (for example: running or walking).

How will I do it?

I will collect data with AndroSensor. And I will use this data for modeling.

Screen of app:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1522552%2Fb323cfb996b0ac5111547aca2212e35f%2Fandro-sensor.png?generation=1584971258940394&alt=media" alt="">

What is accelerometer?

An accelerometer is a device that measures proper acceleration. Proper acceleration, being the acceleration (or rate of change of velocity) of a body in its own instantaneous rest frame, is not the same as coordinate acceleration, being the acceleration in a fixed coordinate system. For example, an accelerometer at rest on the surface of the Earth will measure an acceleration due to Earth's gravity, straight upwards (by definition) of g ≈ 9.81 m/s2. By contrast, accelerometers in free fall (falling toward the center of the Earth at a rate of about 9.81 m/s2) will measure zero. More...

Content

I collect data with my smartphone - Samsung Galaxy J6. Every folder represent data that accelerometer returned, when I did something (run, walk etc.). Every csv file is history of my 15 seconds activity with 0.5 seconds frequency. You can see more data and details in my GitHub.

We have few classes of activity: - running - walking - up/down stairs - idle

Inspiration

I hope that this dataset will help with some researches and will inspire someone.
Mental Health Dataset
kaggle.com
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhavik Jikadara (2024). Mental Health Dataset [Dataset]. https://www.kaggle.com/datasets/bhavikjikadara/mental-health-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 18, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bhavik Jikadara
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset appears to contain a variety of features related to text analysis, sentiment analysis, and psychological indicators, likely derived from posts or text data. Some features include readability indices such as Automated Readability Index (ARI), Coleman Liau Index, and Flesch-Kincaid Grade Level, as well as sentiment analysis scores like sentiment compound, negative, neutral, and positive scores. Additionally, there are features related to psychological aspects such as economic stress, isolation, substance use, and domestic stress. The dataset seems to cover a wide range of linguistic, psychological, and behavioural attributes, potentially suitable for analyzing mental health-related topics in online communities or text data.

Benefits of using this dataset:

Insight into Mental Health: The dataset provides valuable insights into mental health by analyzing linguistic patterns, sentiment, and psychological indicators in text data. Researchers and data scientists can gain a better understanding of how mental health issues manifest in online communication.

Predictive Modeling: With a wide range of features, including sentiment analysis scores and psychological indicators, the dataset offers opportunities for developing predictive models to identify or predict mental health outcomes based on textual data. This can be useful for early intervention and support.

Community Engagement: Mental health is a topic of increasing importance, and this dataset can foster community engagement on platforms like Kaggle. Data enthusiasts, researchers, and mental health professionals can collaborate to analyze the data and develop solutions to address mental health challenges.

Data-driven Insights: By analyzing the dataset, users can uncover correlations and patterns between linguistic features, sentiment, and mental health indicators. These insights can inform interventions, policies, and support systems aimed at promoting mental well-being.

Educational Resource: The dataset can serve as a valuable educational resource for teaching and learning about mental health analytics, sentiment analysis, and text mining techniques. It provides a real-world dataset for students and practitioners to apply data science skills in a meaningful context.
SMILES DataSet for Analysis & Prediction Dataset
kaggle.com
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yan Maksi (2023). SMILES DataSet for Analysis & Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/yanmaksi/big-molecules-smiles-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yan Maksi
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9770082%2Fb234dd748f233e4d3ef1d72d048828b5%2FMastering%20Drug%20Design.jpg?generation=1686502308761641&alt=media" alt="">

Read this article to get unlock the wonderful world Deep Reinforcement Learning for Drug Design

ReLeaSE is a public dataset, consisting of molecular structures and their corresponding binding affinity to proteins. The dataset was created for the purpose of evaluating and comparing machine learning models for the prediction of protein-ligand binding affinity.

The dataset contains a total of 10,000 molecules and their binding affinity to several target proteins, including thrombin, kinase, and protease. The molecular structures are represented using Simplified Molecular Input Line Entry System (SMILES) notation, which is a standardized method for representing molecular structures as a string of characters. The binding affinity is represented as a negative logarithm of the dissociation constant (pKd), which is a measure of the strength of the interaction between the molecule and the target protein.

The ReLeaSE dataset provides a standardized benchmark for evaluating machine learning models for protein-ligand binding affinity prediction. The dataset is publicly available and can be used for research purposes, making it an important resource for the drug discovery community.
Restaurant Order Details
kaggle.com
Updated Jun 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Harris (2022). Restaurant Order Details [Dataset]. https://www.kaggle.com/datasets/mohamedharris/restaurant-order-details
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 8, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mohamed Harris
Description
This dataset has details of orders placed by customers to the restaurants in a food delivery app. There are 500 orders that were placed on a day.

Join both the tables in this dataset to get the complete data.

You can use dataset to find the patterns in the orders placed by customers. You can analyze this dataset to find the answers to the below questions. 1) Which restaurant received the most orders? 2) Which restaurant saw most sales? 3) Which customer ordered the most? 4) When do customers order more in a day? 5) Which is the most liked cuisine? 6) Which zone has the most sales?

Please upvote if you like my work.

Disclaimer: The names of the customers and restaurants used are only for representational purposes. They do not represent any real life nouns, but are only fictional.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Víctor Labayen (2020). Network traffic and code for machine learning classification [Dataset]. http://doi.org/10.17632/5pmnkshffm.2

Network traffic and code for machine learning classification

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.17632/5pmnkshffm.2

Dataset updated

Feb 20, 2020

Authors

Víctor Labayen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.

Activities:

Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.

The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.

The amount of data is stated as follows:

Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes

The code of our machine learning approach is also included. There is a README.txt file with the documentation of how to use the code.

Clear search

Close search

Google apps

Main menu

Network traffic and code for machine learning classification

UMAHand: Hand Activity Dataset (Universidad de Málaga)

Repository Analytics and Metrics Portal (RAMP) 2020 data

Repository Analytics and Metrics Portal (RAMP) 2019 data

Data from: Repository Analytics and Metrics Portal (RAMP) 2021 data

Walker Fall Detection Dataset

Cleaned NHANES 1988-2018

SUBF v1.0 Dataset: Bearing Fault Vibration Data

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7973470%2Fd6325d79c6be2747f8c2368d28b4000f%2Fs1.png?generation=1732113324463780&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7973470%2F34db849fa30587a9f37491cade3258ac%2Fs2.png?generation=1732113382823548&alt=media" alt="">

Repository Analytics and Metrics Portal (RAMP) 2018 data

Dr. Duke's Phytochemical and Ethnobotanical Databases

Customer Shopping Trends Dataset

Context

Content

Dataset Glossary (Column-wise)

Structure of the Dataset

Acknowledgement

Repository Analytics and Metrics Portal (RAMP) 2017 data

National Monuments Service - Archaeological Survey of Ireland

Open Data Portal Catalogue

Kitsune Network Attack Dataset Dataset

Data for activity recognition

Context

Content

Inspiration

Mental Health Dataset

Benefits of using this dataset:

SMILES DataSet for Analysis & Prediction Dataset

Restaurant Order Details

Network traffic and code for machine learning classificationSee More Versions

Network traffic and code for machine learning classification