Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SPHERE is students' performance in physics education research dataset. It is presented as a multi-domain learning dataset of students’ performance on physics that has been collected through several research-based assessments (RBAs) established by the physics education research (PER) community. A total of 497 eleventh-grade students were involved from three large and a small public high school located in a suburban district of a high-populated province in Indonesia. Some variables related to demographics, accessibility to literature resources, and students’ physics identity are also investigated. Some RBAs utilized in this data were selected based on concepts learned by the students in the Indonesian physics curriculum. We commenced the survey of students’ understanding on Newtonian mechanics at the end of the first semester using Force Concept Inventory (FCI) and Force and Motion Conceptual Evaluation (FMCE). In the second semester, we assessed the students’ scientific abilities and learning attitude through Scientific Abilities Assessment Rubrics (SAAR) and the Colorado Learning Attitudes about Science Survey (CLASS) respectively. The conceptual assessments were continued at the second semester measured through Rotational and Rolling Motion Conceptual Survey (RRMCS), Fluid Mechanics Concept Inventory (FMCI), Mechanical Waves Conceptual Survey (MWCS), Thermal Concept Evaluation (TCE), and Survey of Thermodynamic Processes and First and Second Laws (STPFaSL). We expect SPHERE could be a valuable dataset for supporting the advancement of the PER field particularly in quantitative studies. For example, there is a need to help advance research on using machine learning and data mining techniques in PER that might face challenges due to the unavailable dataset for the specific purpose of PER studies. SPHERE can be reused as a students’ performance dataset on physics specifically dedicated for PER scholars which might be willing to implement machine learning techniques in physics education.
Facebook
TwitterDL3 example dataset from Crab Nebula observations with LST-1
This repository contains a subsample of DL3 files from Crab Nebula observations used in the performance study of the Large-Sized Telescope prototype (LST-1, https://www.lst1.iac.es/) for the Cherenkov Telescope Array Observatory (CTAO, https://www.ctao.org/). The results of this performance study [1] were obtained from a larger sample of the Crab Nebula observations than the one compiled here.
This reduced dataset aims to serve as an example for analyzing observed data by one of the telescopes that will be part of the future CTAO. These data files are intended to be used in the hands-on sessions for the 1D high-level DL3 analysis in the CTAO Shcool (https://www.school.cta-observatory.org/).
Information about the data and the reduction process
The DL3 files included in this repository are a subsample of 1.9 hours of Crab Nebula observations data taken on March 4th and 5th, 2022. They were produced with cta-lstchain in FITS format following the Gamma-ray Astronomy Data Format (GADF; [3]). They can be directly read and analyzed with Gammapy [4]. Data were processed following the source-independent analysis approach described in [1]. The gamma-hadron separation and directional cuts (gammaness and theta parameters) for the gamma-ray-like event selection were chosen to keep 70% of gamma-ray-like simulated events in each bin of reconstructed energy. The point-like instrument response functions (IRF) were produced (using pyirf [5]) using the same energy-dependent efficiency cuts from simulated gamma rays in an all-sky grid of pointing positions. Final IRFs for each observation run were produced by linear interpolation among the closest simulated pointing nodes to the actual telescope pointing while observing the Crab Nebula.
List of files
dl3_LST-1.Run07253.fits
dl3_LST-1.Run07254.fits
dl3_LST-1.Run07255.fits
dl3_LST-1.Run07256.fits
dl3_LST-1.Run07274.fits
dl3_LST-1.Run07275.fits
dl3_LST-1.Run07276.fits
dl3_LST-1.Run07277.fits
hdu-index.fits.gz
obs-index.fits.gz
Acknowledgements
The production of these files has been possible thanks to the LST Collaboration work at different levels, namely, hardware and software development, data-taking, production of simulations, and data analysis.
References
[1] H. Abe et al 2023 ApJ 956 80 (DOI 10.3847/1538-4357/ace89d)
[2] cta-lstchain: https://doi.org/10.5281/zenodo.10849683
[3] Data formats for gamma-ray astronomy. https://github.com/open-gamma-ray-astro/gamma-astro-data-formats
[4] A&A, 678, A157 (2023) DOI https://doi.org/10.1051/0004-6361/202346488
[5] pyirf: https://doi.org/10.5281/zenodo.8348922
Facebook
TwitterThis is dataset contains the results of the Water Network Tool for Resilience (WNTR) case study application on a New York drinking water system. The data includes the population impacted from the firefighting and pipe criticality analysis; the water service availability (WSA) and pressure for the loss of source water scenarios; and the modified resilience index and the combined performance index for an example pipe criticality simulation and the loss of source water scenarios. This dataset is associated with the following publication: Chu-Ketterer, L., R. Murray, P. Hassett, J. Kogan, K. Klise, and T. Haxton. Performance and Resilience Analysis of a New York Drinking Water System to Localized and System-Wide Emergencies. JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT. American Society of Civil Engineers (ASCE), Reston, VA, USA, 149(1): 05022015, (2023).
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Data Description
File: vehicle_data.csv Columns: Vehicle_ID: Unique identifier for each vehicle Engine_Size (liters): Engine displacement Cylinders: Number of cylinders Fuel_Type: (Gasoline, Diesel, Hybrid) City_MPG: Fuel efficiency in city driving (miles per gallon) Highway_MPG: Fuel efficiency in highway driving (miles per gallon) CO2_Emissions (grams/mile): Carbon dioxide emissions Data Collection and Preprocessing
Source: Data collected from resources like https://www.fueleconomy.gov/ and manufacturer websites.
Preprocessing: Missing values were handled using mean/median imputation (depending on data distribution). Categorical features (e.g., Fuel_Type) were one-hot encoded. Potential Use Cases
Training regression models to predict CO2 emissions based on vehicle characteristics. Developing classification models to categorize vehicles into emission groups (low, medium, high). Building fuel consumption prediction models for route optimization and logistics. Analyzing the relationship between vehicle features and environmental impact.
Dataset Structure (vehicle_data.csv)
Code snippet Vehicle_ID,Engine_Size,Cylinders,Fuel_Type,City_MPG,Highway_MPG,CO2_Emissions 1,2.0,4,Gasoline,28,36,320 2,3.5,6,Gasoline,20,28,405 3,1.8,4,Hybrid,45,52,210 4,3.0,6,Diesel,22,30,430 ... Use code with caution.
Ethical Considerations
Responsible Use: Promote the development of AI models that support environmentally conscious decision-making in the automotive industry. Bias: Strive to uncover and reduce potential biases present in the data.
Contribution
We welcome contributions to expand and improve this dataset. Please follow these guidelines:
Example Dataset (vehicle_data.csv) You can find a small-scale example dataset like the one described above on platforms like:
Kaggle: Search for "vehicle emissions datasets" on https://www.kaggle.com/datasets UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/index.php Key Points:
Clarity: Focus on making the structure easy to understand. Comprehensiveness: Cover the dataset's origin, composition, potential applications, and ethical considerations. Adaptability: This template can be modified to fit a wide range of AI datasets.
Let me know if you'd like a more detailed dataset example or adjustments for a specific AI project!
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Version update: The originally uploaded versions of the CSV files in this dataset included an extra column, "Unnamed: 0," which is not RAMP data and was an artifact of the process used to export the data to CSV format. This column has been removed from the revised dataset. The data are otherwise the same as in the first version.
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2020. For a description of the data collection, processing, and output methods, please see the "methods" section below.
Methods Data Collection
RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).
Data are downloaded in two sets per participating IR. The first set includes page level statistics about URLs pointing to IR pages and content files. The following fields are downloaded for each URL, with one row per URL:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
Following data processing describe below, on ingest into RAMP a additional field, citableContent, is added to the page level data.
The second set includes similar information, but instead of being aggregated at the page level, the data are grouped based on the country from which the user submitted the corresponding search, and the type of device used. The following fields are downloaded for combination of country and device, with one row per country/device combination:
country: The country from which the corresponding search originated.
device: The device used for the search.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.
More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en
Data Processing
Upon download from GSC, the page level data described above are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of page level statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the page level data which records whether each page/URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."
The data aggregated by the search country of origin and device type do not include URLs. No additional processing is done on these data. Harvested data are passed directly into Elasticsearch.
Processed data are then saved in a series of Elasticsearch indices. Currently, RAMP stores data in two indices per participating IR. One index includes the page level data, the second index includes the country of origin and device type data.
About Citable Content Downloads
Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.
CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).
For any specified date range, the steps to calculate CCD are:
Filter data to only include rows where "citableContent" is set to "Yes."
Sum the value of the "clicks" field on these rows.
Output to CSV
Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above. Also as noted above, daily data are downloaded for each IR in two sets which cannot be combined. One dataset includes the URLs of items that appear in SERP. The second dataset is aggregated by combination of the country from which a search was conducted and the device used.
As a result, two CSV datasets are provided for each month of published data:
page-clicks:
The data in these CSV files correspond to the page-level data, and include the following fields:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No.
index: The Elasticsearch index corresponding to page click data for a single IR.
repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.
Filenames for files containing these data end with “page-clicks”. For example, the file named 2020-01_RAMP_all_page-clicks.csv contains page level click data for all RAMP participating IR for the month of January, 2020.
country-device-info:
The data in these CSV files correspond to the data aggregated by country from which a search was conducted and the device used. These include the following fields:
country: The country from which the corresponding search originated.
device: The device used for the search.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
index: The Elasticsearch index corresponding to country and device access information data for a single IR.
repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.
Filenames for files containing these data end with “country-device-info”. For example, the file named 2020-01_RAMP_all_country-device-info.csv contains country and device data for all participating IR for the month of January, 2020.
References
Google, Inc. (2021). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Study information The sample included in this dataset represents five children who participated in a number line intervention study. Originally six children were included in the study, but one of them fulfilled the criterion for exclusion after missing several consecutive sessions. Thus, their data is not included in the dataset. All participants were currently attending Year 1 of primary school at an independent school in New South Wales, Australia. For children to be able to eligible to participate they had to present with low mathematics achievement by performing at or below the 25th percentile in the Maths Problem Solving and/or Numerical Operations subtests from the Wechsler Individual Achievement Test III (WIAT III A & NZ, Wechsler, 2016). Participants were excluded from participating if, as reported by their parents, they have any other diagnosed disorders such as attention deficit hyperactivity disorder, autism spectrum disorder, intellectual disability, developmental language disorder, cerebral palsy or uncorrected sensory disorders. The study followed a multiple baseline case series design, with a baseline phase, a treatment phase, and a post-treatment phase. The baseline phase varied between two and three measurement points, the treatment phase varied between four and seven measurement points, and all participants had 1 post-treatment measurement point. The number of measurement points were distributed across participants as follows: Participant 1 – 3 baseline, 6 treatment, 1 post-treatment Participant 3 – 2 baseline, 7 treatment, 1 post-treatment Participant 5 – 2 baseline, 5 treatment, 1 post-treatment Participant 6 – 3 baseline, 4 treatment, 1 post-treatment Participant 7 – 2 baseline, 5 treatment, 1 post-treatment In each session across all three phases children were assessed in their performance on a number line estimation task, a single-digit computation task, a multi-digit computation task, a dot comparison task and a number comparison task. Furthermore, during the treatment phase, all children completed the intervention task after these assessments. The order of the assessment tasks varied randomly between sessions.
Measures Number Line Estimation. Children completed a computerised bounded number line task (0-100). The number line is presented in the middle of the screen, and the target number is presented above the start point of the number line to avoid signalling the midpoint (Dackermann et al., 2018). Target numbers included two non-overlapping sets (trained and untrained) of 30 items each. Untrained items were assessed on all phases of the study. Trained items were assessed independent of the intervention during baseline and post-treatment phases, and performance on the intervention is used to index performance on the trained set during the treatment phase. Within each set, numbers were equally distributed throughout the number range, with three items within each ten (0-10, 11-20, 21-30, etc.). Target numbers were presented in random order. Participants did not receive performance-based feedback. Accuracy is indexed by percent absolute error (PAE) [(number estimated - target number)/ scale of number line] x100.
Single-Digit Computation. The task included ten additions with single-digit addends (1-9) and single-digit results (2-9). The order was counterbalanced so that half of the additions present the lowest addend first (e.g., 3 + 5) and half of the additions present the highest addend first (e.g., 6 + 3). This task also included ten subtractions with single-digit minuends (3-9), subtrahends (1-6) and differences (1-6). The items were presented horizontally on the screen accompanied by a sound and participants were required to give a verbal response. Participants did not receive performance-based feedback. Performance on this task was indexed by item-based accuracy.
Multi-digit computational estimation. The task included eight additions and eight subtractions presented with double-digit numbers and three response options. None of the response options represent the correct result. Participants were asked to select the option that was closest to the correct result. In half of the items the calculation involved two double-digit numbers, and in the other half one double and one single digit number. The distance between the correct response option and the exact result of the calculation was two for half of the trials and three for the other half. The calculation was presented vertically on the screen with the three options shown below. The calculations remained on the screen until participants responded by clicking on one of the options on the screen. Participants did not receive performance-based feedback. Performance on this task is measured by item-based accuracy.
Dot Comparison and Number Comparison. Both tasks included the same 20 items, which were presented twice, counterbalancing left and right presentation. Magnitudes to be compared were between 5 and 99, with four items for each of the following ratios: .91, .83, .77, .71, .67. Both quantities were presented horizontally side by side, and participants were instructed to press one of two keys (F or J), as quickly as possible, to indicate the largest one. Items were presented in random order and participants did not receive performance-based feedback. In the non-symbolic comparison task (dot comparison) the two sets of dots remained on the screen for a maximum of two seconds (to prevent counting). Overall area and convex hull for both sets of dots is kept constant following Guillaume et al. (2020). In the symbolic comparison task (Arabic numbers), the numbers remained on the screen until a response was given. Performance on both tasks was indexed by accuracy.
The Number Line Intervention During the intervention sessions, participants estimated the position of 30 Arabic numbers in a 0-100 bounded number line. As a form of feedback, within each item, the participants’ estimate remained visible, and the correct position of the target number appeared on the number line. When the estimate’s PAE was lower than 2.5, a message appeared on the screen that read “Excellent job”, when PAE was between 2.5 and 5 the message read “Well done, so close! and when PAE was higher than 5 the message read “Good try!” Numbers were presented in random order.
Variables in the dataset Age = age in ‘years, months’ at the start of the study Sex = female/male/non-binary or third gender/prefer not to say (as reported by parents) Math_Problem_Solving_raw = Raw score on the Math Problem Solving subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016). Math_Problem_Solving_Percentile = Percentile equivalent on the Math Problem Solving subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016). Num_Ops_Raw = Raw score on the Numerical Operations subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016). Math_Problem_Solving_Percentile = Percentile equivalent on the Numerical Operations subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016).
The remaining variables refer to participants’ performance on the study tasks. Each variable name is composed by three sections. The first one refers to the phase and session. For example, Base1 refers to the first measurement point of the baseline phase, Treat1 to the first measurement point on the treatment phase, and post1 to the first measurement point on the post-treatment phase.
The second part of the variable name refers to the task, as follows: DC = dot comparison SDC = single-digit computation NLE_UT = number line estimation (untrained set) NLE_T= number line estimation (trained set) CE = multidigit computational estimation NC = number comparison The final part of the variable name refers to the type of measure being used (i.e., acc = total correct responses and pae = percent absolute error).
Thus, variable Base2_NC_acc corresponds to accuracy on the number comparison task during the second measurement point of the baseline phase and Treat3_NLE_UT_pae refers to the percent absolute error on the untrained set of the number line task during the third session of the Treatment phase.
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
This dataset consists of five CSV files that provide detailed data on a stock portfolio and related market performance over the last 5 years. It includes portfolio positions, stock prices, and major U.S. market indices (NASDAQ, S&P 500, and Dow Jones). The data is essential for conducting portfolio analysis, financial modeling, and performance tracking.
This file contains the portfolio composition with details about individual stock positions, including the quantity of shares, sector, and their respective weights in the portfolio. The data also includes the stock's closing price.
Ticker: The stock symbol (e.g., AAPL, TSLA) Quantity: The number of shares in the portfolio Sector: The sector the stock belongs to (e.g., Technology, Healthcare) Close: The closing price of the stock Weight: The weight of the stock in the portfolio (as a percentage of total portfolio)This file contains historical pricing data for the stocks in the portfolio. It includes daily open, high, low, close prices, adjusted close prices, returns, and volume of traded stocks.
Date: The date of the data point Ticker: The stock symbol Open: The opening price of the stock on that day High: The highest price reached on that day Low: The lowest price reached on that day Close: The closing price of the stock Adjusted: The adjusted closing price after stock splits and dividends Returns: Daily percentage return based on close prices Volume: The volume of shares traded that dayThis file contains historical pricing data for the NASDAQ Composite index, providing similar data as in the Portfolio Prices file, but for the NASDAQ market index.
Date: The date of the data point Ticker: The stock symbol (for NASDAQ index, this will be "IXIC") Open: The opening price of the index High: The highest value reached on that day Low: The lowest value reached on that day Close: The closing value of the index Adjusted: The adjusted closing value after any corporate actions Returns: Daily percentage return based on close values Volume: The volume of shares tradedThis file contains similar historical pricing data, but for the S&P 500 index, providing insights into the performance of the top 500 U.S. companies.
Date: The date of the data point Ticker: The stock symbol (for S&P 500 index, this will be "SPX") Open: The opening price of the index High: The highest value reached on that day Low: The lowest value reached on that day Close: The closing value of the index Adjusted: The adjusted closing value after any corporate actions Returns: Daily percentage return based on close values Volume: The volume of shares tradedThis file contains similar historical pricing data for the Dow Jones Industrial Average, providing insights into one of the most widely followed stock market indices in the world.
Date: The date of the data point Ticker: The stock symbol (for Dow Jones index, this will be "DJI") Open: The opening price of the index High: The highest value reached on that day Low: The lowest value reached on that day Close: The closing value of the index Adjusted: The adjusted closing value after any corporate actions Returns: Daily percentage return based on close values Volume: The volume of shares tradedThis data is received using a custom framework that fetches real-time and historical stock data from Yahoo Finance. It provides the portfolio’s data based on user-specific stock holdings and performance, allowing for personalized analysis. The personal framework ensures the portfolio data is automatically retrieved and updated with the latest stock prices, returns, and performance metrics.
This part of the dataset would typically involve data specific to a particular user’s stock positions, weights, and performance, which can be integrated with the other files for portfolio performance analysis.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2021. For a description of the data collection, processing, and output methods, please see the "methods" section below.
The record will be revised periodically to make new data available through the remainder of 2021.
Methods
Data Collection
RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).
Data are downloaded in two sets per participating IR. The first set includes page level statistics about URLs pointing to IR pages and content files. The following fields are downloaded for each URL, with one row per URL:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
Following data processing describe below, on ingest into RAMP a additional field, citableContent, is added to the page level data.
The second set includes similar information, but instead of being aggregated at the page level, the data are grouped based on the country from which the user submitted the corresponding search, and the type of device used. The following fields are downloaded for combination of country and device, with one row per country/device combination:
country: The country from which the corresponding search originated.
device: The device used for the search.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.
More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en
Data Processing
Upon download from GSC, the page level data described above are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of page level statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the page level data which records whether each page/URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."
The data aggregated by the search country of origin and device type do not include URLs. No additional processing is done on these data. Harvested data are passed directly into Elasticsearch.
Processed data are then saved in a series of Elasticsearch indices. Currently, RAMP stores data in two indices per participating IR. One index includes the page level data, the second index includes the country of origin and device type data.
About Citable Content Downloads
Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.
CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).
For any specified date range, the steps to calculate CCD are:
Filter data to only include rows where "citableContent" is set to "Yes."
Sum the value of the "clicks" field on these rows.
Output to CSV
Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above. Also as noted above, daily data are downloaded for each IR in two sets which cannot be combined. One dataset includes the URLs of items that appear in SERP. The second dataset is aggregated by combination of the country from which a search was conducted and the device used.
As a result, two CSV datasets are provided for each month of published data:
page-clicks:
The data in these CSV files correspond to the page-level data, and include the following fields:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No.
index: The Elasticsearch index corresponding to page click data for a single IR.
repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.
Filenames for files containing these data end with “page-clicks”. For example, the file named 2021-01_RAMP_all_page-clicks.csv contains page level click data for all RAMP participating IR for the month of January, 2021.
country-device-info:
The data in these CSV files correspond to the data aggregated by country from which a search was conducted and the device used. These include the following fields:
country: The country from which the corresponding search originated.
device: The device used for the search.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
index: The Elasticsearch index corresponding to country and device access information data for a single IR.
repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.
Filenames for files containing these data end with “country-device-info”. For example, the file named 2021-01_RAMP_all_country-device-info.csv contains country and device data for all participating IR for the month of January, 2021.
References
Google, Inc. (2021). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original.
Facebook
TwitterGUI-based software coded in PYTHON to promote throughput image processing and analytics of a big dataset of satellite imagery and provide spatiotemporal monitoring of crop health conditions throughout the growing season by automatically illustrating 1) a field map calendar (FMC) with daily thumbnails of vegetation heatmaps in each month and 2) a seasonal Vegetation Index (VI) Profile of the crop fields. Output examples of FMC and VI Profile are found in files named in fmCalendar.jpg and NDVI_Profile.jpg, respectively, which were created satellite imagery on 5/1-10/31 in 2020 from a sugarbeet field in Moorhead, MN.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2017. For a description of the data collection, processing, and output methods, please see the "methods" section below.
Methods RAMP Data Documentation – January 1, 2017 through August 18, 2018
Data Collection
RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).
Data from January 1, 2017 through August 18, 2018 were downloaded in one dataset per participating IR. The following fields were downloaded for each URL, with one row per URL:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
country: The country from which the corresponding search originated.
device: The device used for the search.
date: The date of the search.
Following data processing describe below, on ingest into RAMP an additional field, citableContent, is added to the page level data.
Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.
More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en
Data Processing
Upon download from GSC, data are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the data which records whether each URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."
Processed data are then saved in a series of Elasticsearch indices. From January 1, 2017, through August 18, 2018, RAMP stored data in one index per participating IR.
About Citable Content Downloads
Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.
CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).
For any specified date range, the steps to calculate CCD are:
Filter data to only include rows where "citableContent" is set to "Yes."
Sum the value of the "clicks" field on these rows.
Output to CSV
Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above.
The data in these CSV files include the following fields:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
country: The country from which the corresponding search originated.
device: The device used for the search.
date: The date of the search.
citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No.
index: The Elasticsearch index corresponding to page click data for a single IR.
repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the index field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.
Filenames for files containing these data follow the format 2017-01_RAMP_all.csv. Using this example, the file 2017-01_RAMP_all.csv contains all data for all RAMP participating IR for the month of January, 2017.
References
Google, Inc. (2021). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The study of N-linked glycosylation has long been complicated by a lack of bioinformatics tools. In particular, there is still a lack of fast and robust data processing tools for targeted (relative) quantitation. We have developed modular, high-throughput data processing software, MassyTools, that is capable of calibrating spectra, extracting data, and performing quality control calculations based on a user-defined list of glycan or glycopeptide compositions. Typical examples of output include relative areas after background subtraction, isotopic pattern-based quality scores, spectral quality scores, and signal-to-noise ratios. We demonstrated MassyTools’ performance on MALDI-TOF-MS glycan and glycopeptide data from different samples. MassyTools yielded better calibration than the commercial software flexAnalysis, generally showing 2-fold better ppm errors after internal calibration. Relative quantitation using MassyTools and flexAnalysis gave similar results, yielding a relative standard deviation (RSD) of the main glycan of ∼6%. However, MassyTools yielded 2- to 5-fold lower RSD values for low-abundant analytes than flexAnalysis. Additionally, feature curation based on the computed quality criteria improved the data quality. In conclusion, we show that MassyTools is a robust automated data processing tool for high-throughput, high-performance glycosylation analysis. The package is released under the Apache 2.0 license and is freely available on GitHub (https://github.com/Tarskin/MassyTools).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An example dataset containing B-cell receptor (BCR) gene sequences. This dataset is intended to be used for testing software tools developed to annotate (i.e. map Variable, Diversity and Joining segments) and perform clonal analysis of BCR sequencing data.
Sequencing:
Libraries prepared using 5'RACE from PBMCs of a healthy donor. Input molecules were tagged with unique molecular identifiers (UMIs). Sequencing was ran on MiSeq , 300+300bp reads.
Contents:
The dataset contains both raw sequencing reads and high-quality consensus sequences assembled using unique molecular tagging (UMI) approach. Consensus assembly corrects for sequencing errors and eliminates sequencing artifacts.
All files contain an UMI tag sequence in their header, in form UMI:NNNN:QQQQ where N is the base character and Q is the quality character (for assembled consensuses the total number of reads is given instead of Q string).
Note that consensus sequences were assembled using only raw sequences that correspond to UMI tags supported by at least 10 sequencing reads. That means that consensus sequence files contain a subset of all UMI tags found in raw sequences. Thus, if one wants to assess software performance on raw sequencing reads using assembled consensus sequences as a high-quality data standard, raw sequencing reads should be filtered to contain only those UMI tags that are present in consensus sequence file.
Citations:
The whole dataset was used to benchmark MiXCR software and was originally referenced in Bolotin DA, et al. MiXCR: software for comprehensive adaptive immunity profiling Nature methods 12(5):380-381, 2015.
Data pre-processing was carried out using MIGEC software, Shugay M et al. Towards error-free profiling of immune repertoires. Nature Methods 11(6):653-655, 2014.
Contributors:
The dataset was generated in Prof. Chudakov lab (Adaptive Immunity Group in Masaryk University, Brno and Genomics of Adaptive Immunity Lab in Institute of Bioorganic Chemistry, Moscow). Sample preparation and sequencing was performed by Dr. Olga Britanova and Dr. Maria Turchaninova. Raw sequencing reads were pre-processed and uploaded by Dr. Mikhail Shugay.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2018. For a description of the data collection, processing, and output methods, please see the "methods" section below. Note that the RAMP data model changed in August, 2018 and two sets of documentation are provided to describe data collection and processing before and after the change.
Methods
RAMP Data Documentation – January 1, 2017 through August 18, 2018
Data Collection
RAMP data were downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).
Data from January 1, 2017 through August 18, 2018 were downloaded in one dataset per participating IR. The following fields were downloaded for each URL, with one row per URL:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
country: The country from which the corresponding search originated.
device: The device used for the search.
date: The date of the search.
Following data processing describe below, on ingest into RAMP an additional field, citableContent, is added to the page level data.
Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.
More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en
Data Processing
Upon download from GSC, data are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the data which records whether each URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."
Processed data are then saved in a series of Elasticsearch indices. From January 1, 2017, through August 18, 2018, RAMP stored data in one index per participating IR.
About Citable Content Downloads
Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.
CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).
For any specified date range, the steps to calculate CCD are:
Filter data to only include rows where "citableContent" is set to "Yes."
Sum the value of the "clicks" field on these rows.
Output to CSV
Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above.
The data in these CSV files include the following fields:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
country: The country from which the corresponding search originated.
device: The device used for the search.
date: The date of the search.
citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No.
index: The Elasticsearch index corresponding to page click data for a single IR.
repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the index field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.
Filenames for files containing these data follow the format 2018-01_RAMP_all.csv. Using this example, the file 2018-01_RAMP_all.csv contains all data for all RAMP participating IR for the month of January, 2018.
Data Collection from August 19, 2018 Onward
RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).
Data are downloaded in two sets per participating IR. The first set includes page level statistics about URLs pointing to IR pages and content files. The following fields are downloaded for each URL, with one row per URL:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
Following data processing describe below, on ingest into RAMP a additional field, citableContent, is added to the page level data.
The second set includes similar information, but instead of being aggregated at the page level, the data are grouped based on the country from which the user submitted the corresponding search, and the type of device used. The following fields are downloaded for combination of country and device, with one row per country/device combination:
country: The country from which the corresponding search originated.
device: The device used for the search.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.
More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en
Data Processing
Upon download from GSC, the page level data described above are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of page level statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the page level data which records whether each page/URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."
The data aggregated by the search country of origin and device type do not include URLs. No additional processing is done on these data. Harvested data are passed directly into Elasticsearch.
Processed data are then saved in a series of Elasticsearch indices. Currently, RAMP stores data in two indices per participating IR. One index includes the page level data, the second index includes the country of origin and device type data.
About Citable Content Downloads
Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An example of the Row Table.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study is based on the historical data for some of the indicators on the Egyptian Stock Exchange (EGX), in order to build a prediction model with high accuracy. Data used in this study are purchased from Egypt for Information Dissemination (EGID) which is a Governmental organization that provides data for EGX. The data contain six stock market indices; for example, EGX-30 index local currency is used for interest estimates and denominated in US dollars. It measures top 30 firms in liquidity and activity. The second index used in this study is EGX-30- Capped which is designed to track performance of the most traded companies in accordance with the rules set for mutual funds. The third index is EGX-70 which aims at providing wider tools for investors to monitor market performance. EGX-100 index as a forth dataset evaluates performance of the 100 active firms, including 30 of EGX- 30 index as well as 70 of EGX-70 index. NIlE index avoids concentration on one industry and therefore has a good representation of various industries/sectors in the economy, and the index is weighted by market capitalization and adjusted by free float. The last index is EGX-50-EWI which tracks top 50 companies in terms of liquidity and activity. The index is designed to balance the impact of price changes among the constituents of the index as they will have a fixed weight of 2% at each quarterly review.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Manufacturing process feature selection and categorization
Abstract: Data from a semi-conductor manufacturing process
A complex modern semi-conductor manufacturing process is normally under consistent surveillance via the monitoring of signals/variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. It is often the case that useful information is buried in the latter two. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs.
To enhance current business improvement techniques the application of feature selection as an intelligent systems technique is being investigated.
The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing, figure 2, and associated date time stamp. Where .1 corresponds to a pass and 1 corresponds to a fail and the data time stamp is for that specific test point.
Using feature selection techniques it is desired to rank features according to their impact on the overall yield for the product, causal relationships may also be considered with a view to identifying the key features.
Results may be submitted in terms of feature relevance for predictability using error rates as our evaluation metrics. It is suggested that cross validation be applied to generate these results. Some baseline results are shown below for basic feature selection techniques using a simple kernel ridge classifier and 10 fold cross validation.
Baseline Results: Pre-processing objects were applied to the dataset simply to standardize the data and remove the constant features and then a number of different feature selection objects selecting 40 highest ranked features were applied with a simple classifier to achieve some initial results. 10 fold cross validation was used and the balanced error rate (*BER) generated as our initial performance metric to help investigate this dataset.
SECOM Dataset: 1567 examples 591 features, 104 fails
FSmethod (40 features) BER % True + % True - % S2N (signal to noise) 34.5 +-2.6 57.8 +-5.3 73.1 +2.1 Ttest 33.7 +-2.1 59.6 +-4.7 73.0 +-1.8 Relief 40.1 +-2.8 48.3 +-5.9 71.6 +-3.2 Pearson 34.1 +-2.0 57.4 +-4.3 74.4 +-4.9 Ftest 33.5 +-2.2 59.1 +-4.8 73.8 +-1.8 Gram Schmidt 35.6 +-2.4 51.2 +-11.8 77.5 +-2.3
Attribute Information:
Key facts: Data Structure: The data consists of 2 files the dataset file SECOM consisting of 1567 examples each with 591 features a 1567 x 591 matrix and a labels file containing the classifications and date time stamp for each example.
As with any real life data situations this data contains null values varying in intensity depending on the individuals features. This needs to be taken into consideration when investigating the data either through pre-processing or within the technique applied.
The data is represented in a raw text file each line representing an individual example and the features seperated by spaces. The null values are represented by the 'NaN' value as per MatLab.
Authors: Michael McCann, Adrian Johnston
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
List of all 81 selected documents.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The main file is performance_correction.html in AAN3_analysis_scripts.zip. It contains the results of the main analyses.
See AAN3_readme_figshare.txt: 1. Title of Dataset:Open data: Is auditory awareness negativity confounded by performance?
Author Information A. Principal Investigator Contact Information Name: Stefan Wiens Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.su.se/profiles/swiens-1.184142 Email: sws@psychology.su.se
B. Associate or Co-investigator Contact Information Name: Rasmus Eklund Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.su.se/profiles/raek2031-1.223133 Email: rasmus.eklund@psychology.su.se
C. Associate or Co-investigator Contact Information Name: Billy Gerdfeldter Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.su.se/profiles/bige1544-1.403208 Email: billy.gerdfeldter@psychology.su.se
Date of data collection: Subjects (N = 28) were tested between 2018-12-03 and 2019-01-18.
Geographic location of data collection: Department of Psychology, Stockholm, Sweden
Information about funding sources that supported the collection of the data: Swedish Research Council / Vetenskapsrådet (Grant 2015-01181) Marianne and Marcus Wallenberg (Grant 2019-0102)
SHARING/ACCESS INFORMATION
Licenses/restrictions placed on the data: CC BY 4.0
Links to publications that cite or use the data: Eklund R., Gerdfeldter B., & Wiens S. (2020). Is auditory awareness negativity confounded by performance? Consciousness and Cognition. https://doi.org/10.1016/j.concog.2020.102954
The study was preregistered: https://doi.org/10.17605/OSF.IO/W4U7V
Links to other publicly accessible locations of the data: N/A
Links/relationships to ancillary data sets: N/A
Was data derived from another source? No
Recommended citation for this dataset: Eklund R., Gerdfeldter B., & Wiens S. (2020). Open data: Is auditory awareness negativity confounded by performance? Stockholm: Stockholm University. https://doi.org/10.17045/sthlmuni.9724280
DATA & FILE OVERVIEW
File List: The files contain the raw data, scripts, and results of main and supplementary analyses of the electroencephalography (EEG) study. Links to the hardware and software are provided under methodological information.
AAN3_experiment_scripts.zip: contains the Python files to run the experiment
AAN3_rawdata_EEG.zip: contains raw EEG data files for each subject in .bdf format (generated by Biosemi)
AAN3_rawdata_log.zip: contains log files of the EEG session (generated by Python)
AAN3_EEG_scripts.zip: Python-MNE scripts to process and to analyze the EEG data
AAN3_EEG_source_localization_scripts.zip: Python-MNE files needed for source localization. The template MRI is provided in this zip. The files are obtained from the MNE tutorial (https://mne.tools/stable/auto_tutorials/source-modeling/plot_eeg_no_mri.html?highlight=template). Note that the stc folder is empty. The source time course files are not provided because of their large size. They can quickly be generated from the analysis script. They are needed for the source localization.
AAN3_analysis_scripts.zip: R scripts to analyze the data. The main file is performance_correction.html. It contains the results of the main analyses.
AAN3_results.zip: contains summary data files, figures, and tables that are created by Python-MNE and R.
METHODOLOGICAL INFORMATION
Description of methods used for collection/generation of data: The auditory stimuli were two 100-ms tones (f = 900 Hz and 1400 Hz, 5 ms fade-in and fade-out). The experiment was programmed in Python: https://www.python.org/ and used extra functions from here: https://github.com/stamnosslin/mn The EEG data were recorded with an Active Two BioSemi system (BioSemi, Amsterdam, Netherlands; www.biosemi.com) and saved in .bdf format. For more information, see linked publication.
Methods for processing the data: We computed event-related potentials and source localization. See linked publication
Instrument- or software-specific information needed to interpret the data: MNE-Python (Gramfort A., et al., 2013): https://mne.tools/stable/index.html# Rstudio used with R (R Core Team, 2016): https://rstudio.com/products/rstudio/ Wiens, S. (2017). Aladins Bayes Factor in R (Version 3). https://www.doi.org/10.17045/sthlmuni.4981154.v3
Standards and calibration information, if appropriate: For information, see linked publication.
Environmental/experimental conditions: For information, see linked publication.
Describe any quality-assurance procedures performed on the data: For information, see linked publication.
People involved with sample collection, processing, analysis and/or submission:
DATA-SPECIFIC INFORMATION: All relevant information can be found in the MNE-Python and R scripts (in EEG_scripts and analysis_scripts folders) that process the raw data. For example, we added notes to explain what different variables mean.
The folder structure needs to be as follows: AAN3 (main folder) --->data --->--->bdf (AAN3_rawdata_EEG) --->--->log (AAN3_rawdata_log) --->--->raw (empty) --->MNE (AAN3_EEG_scripts) --->R (AAN3_analysis_scripts) --->results (AAN3_results) --->source (AAN3_EEG_source_localization_files)
To run the MNE-Python scripts: Anaconda was used with MNE-Python 0.20 (see installation at https://mne.tools/stable/index.html# ). For Downsample_AAN3, ICA_raw_AAN3, Preprocess_AAN3, Make_inverse_operator_AAN3.py, BehaviorTables_AAN3, and PlotSource, the complete scripts should be run (from anaconda prompt). For Analysis_AAN3, one section at the time should be run (from Spyder).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparative experiments of multimodal sentiment analysis models on the dataset CMU-MOSEI.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SPHERE is students' performance in physics education research dataset. It is presented as a multi-domain learning dataset of students’ performance on physics that has been collected through several research-based assessments (RBAs) established by the physics education research (PER) community. A total of 497 eleventh-grade students were involved from three large and a small public high school located in a suburban district of a high-populated province in Indonesia. Some variables related to demographics, accessibility to literature resources, and students’ physics identity are also investigated. Some RBAs utilized in this data were selected based on concepts learned by the students in the Indonesian physics curriculum. We commenced the survey of students’ understanding on Newtonian mechanics at the end of the first semester using Force Concept Inventory (FCI) and Force and Motion Conceptual Evaluation (FMCE). In the second semester, we assessed the students’ scientific abilities and learning attitude through Scientific Abilities Assessment Rubrics (SAAR) and the Colorado Learning Attitudes about Science Survey (CLASS) respectively. The conceptual assessments were continued at the second semester measured through Rotational and Rolling Motion Conceptual Survey (RRMCS), Fluid Mechanics Concept Inventory (FMCI), Mechanical Waves Conceptual Survey (MWCS), Thermal Concept Evaluation (TCE), and Survey of Thermodynamic Processes and First and Second Laws (STPFaSL). We expect SPHERE could be a valuable dataset for supporting the advancement of the PER field particularly in quantitative studies. For example, there is a need to help advance research on using machine learning and data mining techniques in PER that might face challenges due to the unavailable dataset for the specific purpose of PER studies. SPHERE can be reused as a students’ performance dataset on physics specifically dedicated for PER scholars which might be willing to implement machine learning techniques in physics education.