Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset with 72000 pins from 117 users in Pinterest. Each pin contains a short raw text and an image. The images are processed using a pretrained Convolutional Neural Network and transformed into a vector of 4096 features.
This dataset was used in the paper "User Identification in Pinterest Through the Refinement of a Cascade Fusion of Text and Images" to idenfity specific users given their comments. The paper is publishe in the Research in Computing Science Journal, as part of the LKE 2017 conference. The dataset includes the splits used in the paper.
There are nine files. text_test, text_train and text_val, contain the raw text of each pin in the corresponding split of the data. imag_test, imag_train and imag_val contain the image features of each pin in the corresponding split of the data. train_user and val_test_users contain the index of the user of each pin (between 0 and 116). There is a correspondance one-to-one among the test, train and validation files for images, text and users. There are 400 pins per user in the train set, and 100 pins per user in the validation and test sets each one.
If you have questions regarding the data, write to: jc dot gomez at ugto dot mx
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
PADMINI: A PEER-TO-PEER DISTRIBUTED ASTRONOMY DATA MINING SYSTEM AND A CASE STUDY
TUSHAR MAHULE*, KIRK BORNE**, SANDIPAN DEY*, SUGANDHA ARORA*, AND HILLOL KARGUPTA***
Abstract. Peer-to-Peer (P2P) networks are appealing for astronomy data mining from virtual observatories because of the large volume of the data, compute-intensive tasks, potentially large number of users, and distributed nature of the data analysis process. This paper offers a brief overview of PADMINI—a Peer-to-Peer Astronomy Data MINIng system. It also presents a case study on PADMINI for distributed outlier detection using astronomy data. PADMINI is a webbased system powered by Google Sky and distributed data mining algorithms that run on a collection of computing nodes. This paper offers a case study of the PADMINI evaluating the architecture and the performance of the overall system. Detailed experimental results are presented in order to document the utility and scalability of the system.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Real-world CDR records gathered from Telemarketer PBX and mobile phone users.
Peer-to-peer (P2P) networks are gaining popularity in many applications such as file sharing, e-commerce, and social networking, many of which deal with rich, distributed data sources that can benefit from data mining. P2P networks are, in fact,well-suited to distributed data mining (DDM), which deals with the problem of data analysis in environments with distributed data,computing nodes,and users. This article offers an overview of DDM applications and algorithms for P2P environments,focusing particularly on local algorithms that perform data analysis by using computing primitives with limited communication overhead. The authors describe both exact and approximate local P2P data mining algorithms that work in a decentralized and communication-efficient manner.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Due to the inherent characteristics of accumulation sequence of unbalanced data, the mining results of this kind of data are often affected by a large number of categories, resulting in the decline of mining performance. To solve the above problems, the performance of data cumulative sequence mining is optimized. The algorithm for mining cumulative sequence of unbalanced data based on probability matrix decomposition is studied. The natural nearest neighbor of a few samples in the unbalanced data cumulative sequence is determined, and the few samples in the unbalanced data cumulative sequence are clustered according to the natural nearest neighbor relationship. In the same cluster, new samples are generated from the core points of dense regions and non core points of sparse regions, and then new samples are added to the original data accumulation sequence to balance the data accumulation sequence. The probability matrix decomposition method is used to generate two random number matrices with Gaussian distribution in the cumulative sequence of balanced data, and the linear combination of low dimensional eigenvectors is used to explain the preference of specific users for the data sequence; At the same time, from a global perspective, the AdaBoost idea is used to adaptively adjust the sample weight and optimize the probability matrix decomposition algorithm. Experimental results show that the algorithm can effectively generate new samples, improve the imbalance of data accumulation sequence, and obtain more accurate mining results. Optimizing global errors as well as more efficient single-sample errors. When the decomposition dimension is 5, the minimum RMSE is obtained. The proposed algorithm has good classification performance for the cumulative sequence of balanced data, and the average ranking of index F value, G mean and AUC is the best.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Geographic information provides an important insight into many data mining and social media systems. However, users are reluctant to provide such information due to various concerns, such as inconvenience, privacy, etc. In this paper, we aim to develop a deep learning based solution to predict geographic information for tweets. The current approaches bear two major limitations, including (a) hard to model the long term information and (b) hard to explain to the end users what the model learns. To address these issues, our proposed model embraces three key ideas. First, we introduce a multi-head self-attention model for text representation. Second, to further improve the result on informal language, we treat subword as a feature in our model. Lastly, the model is trained jointly with the city and country to incorporate the information coming from different labels. The experiment performed on W-NUT 2016 Geo-tagging shared task shows our proposed model is competitive with the state-of-the-art systems when using accuracy measurement, and in the meanwhile, leading to a better distance measure over the existing approaches.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Each study reviewed is here catalogued as follows.· Level of difficulty: Classification Task, Number and List of Classes.· Approach: Method and Main Features.· Performance: Score, Metric, Validation Method.· Realism of dataset: Ground Truth, Person-day, Respondents, Observations, Collection Time, Area, Smartphone App.· Sensors involved: AGPS, Inertial Navigation Systems (INS), Geographic Information Systems (GIS), Data Fusion.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the review results of the manuscript of "A Systematic Review on Privacy-Preserving Distributed Data Mining" authored by Chang Sun, Lianne Ippel, Andre Dekker, Michel Dumontier, Johan van Soest. In the datasets, there are 231 published articles about privacy-perserving distributed data mining. Variables include article DOI, title, authors, keywords, user scenarios, distributed data scenarios, privacy/security definition/proof/analysis, privacy statement, privacy-preserving methods category, privacy-preserving methods (specific), data mining problem, data mining/machine learning methods, experiment data information, accuracy of the methods, efficiency (computation and communication cost), and scalability. The search method and evaluation criteria are described in the paper "A Systematic Review on Privacy-Preserving Distributed Data Mining". The DOI and link to the paper will be provided when the paper gets published.
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Business Analytics Market was valued at USD 84.42 Billion in 2024 and is projected to reach USD 176.14 Billion by 2031, growing at a CAGR of 9.63% from 2024 to 2031.
Global Business Analytics Market Drivers
The market drivers for the Business Analytics Market can be influenced by various factors. These may include:
Growing Adoption of Big Data Analytics: In order to extract meaningful insights from their data, organizations are progressively using big data analytics in response to the exponential expansion of data. Making educated decisions through data analysis is facilitated by business analytics.
Growing Need for Data-driven Decision Making: In order to obtain a competitive edge, businesses are realizing the significance of data-driven decision making. The methods and instruments for data analysis and significant insights extraction for improved decision-making are offered by business analytics.
Growing Need for Predictive and Prescriptive Analytics: Predictive and prescriptive analytics are becoming more and more in demand as a means of projecting future trends and results. Businesses can use business analytics to prescribe activities to achieve desired outcomes and forecast future outcomes based on previous data.
Growing Emphasis on Customer Analytics: As e-commerce and digital marketing gain traction, companies are putting more of an emphasis on comprehending the behavior and preferences of their customers. In order to increase consumer engagement and personalize marketing efforts, business analytics is used to analyze customer data.
Emergence of Advanced Technologies: The use of advanced analytics solutions is being propelled by developments in fields like artificial intelligence (AI), machine learning (ML), and natural language processing (NLP). Businesses may now analyze data more effectively and gain deeper insights thanks to these technologies.
Operational Efficiency and Cost Optimization Are Necessary: Companies are always under pressure to increase operational efficiency and reduce costs. Business analytics promotes market expansion by assisting in the identification of opportunities for process and cost-cutting enhancements.
Compliance and Regulatory Requirements: The use of business analytics solutions for risk management and compliance reporting is being fueled by the growing regulatory requirements in a number of industries, including healthcare, banking, and retail.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Anomaly detection is a process of identifying items, events or observations, which do not conform to an expected pattern in a dataset or time series. Current and future missions and our research communities challenge us to rapidly identify features and anomalies in complex and voluminous observations to further science and improve decision support. Given this data intensive reality, we propose to develop an anomaly detection system, called OceanXtremes, powered by an intelligent, elastic Cloud-based analytic service backend that enables execution of domain-specific, multi-scale anomaly and feature detection algorithms across the entire archive of ocean science datasets. A parallel analytics engine will be developed as the key computational and data-mining core of OceanXtreams' backend processing. This analytic engine will demonstrate three new technology ideas to provide rapid turn around on climatology computation and anomaly detection: 1. An adaption of the Hadoop/MapReduce framework for parallel data mining of science datasets, typically large 3 or 4 dimensional arrays packaged in NetCDF and HDF. 2. An algorithm profiling service to efficiently and cost-effectively scale up hybrid Cloud computing resources based on the needs of scheduled jobs (CPU, memory, network, and bursting from a private Cloud computing cluster to public cloud provider like Amazon Cloud services). 3. An extension to industry-standard search solutions (OpenSearch and Faceted search) to provide support for shared discovery and exploration of ocean phenomena and anomalies, along with unexpected correlations between key measured variables. We will use a hybrid Cloud compute cluster (private Eucalyptus on-premise at JPL with bursting to Amazon Web Services) as the operational backend. The key idea is that the parallel data-mining operations will be run 'near' the ocean data archives (a local 'network' hop) so that we can efficiently access the thousands of (say, daily) files making up a three decade time-series, and then cache key variables and pre-computed climatologies in a high-performance parallel database. OceanXtremes will be equipped with both web portal and web service interfaces for users and applications/systems to register and retrieve oceanographic anomalies data. By leveraging technology such as Datacasting (Bingham, et.al, 2007), users can also subscribe to anomaly or 'event' types of their interest and have newly computed anomaly metrics and other information delivered to them by metadata feeds packaged in standard Rich Site Summary (RSS) format. Upon receiving new feed entries, users can examine the metrics and download relevant variables, by simply clicking on a link, to begin further analyzing the event. The OceanXtremes web portal will allow users to define their own anomaly or feature types where continuous backend processing will be scheduled to populate the new user-defined anomaly type by executing the chosen data mining algorithm (i.e. differences from climatology or gradients above a specified threshold). Metadata on the identified anomalies will be cataloged including temporal and geospatial profiles, key physical metrics, related observational artifacts and other relevant metadata to facilitate discovery, extraction, and visualization. Products created by the anomaly detection algorithm will be made explorable and subsettable using Webification (Huang, et.al, 2014) and OPeNDAP (http://opendap.org) technologies. Using this platform scientists can efficiently search for anomalies or ocean phenomena, compute data metrics for events or over time-series of ocean variables, and efficiently find and access all of the data relevant to their study (and then download only that data).
https://www.zionmarketresearch.com/privacy-policyhttps://www.zionmarketresearch.com/privacy-policy
Global Predictive Analytics Market size worth at USD 16.19 Billion in 2023 and projected to USD 113.8 Billion by 2032, with a CAGR of around 24.19% between 2024-2032.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
A dataset consisting of 751,500 English app reviews of 12 online shopping apps. The dataset was scraped from the internet using a python script. This ShoppingAppReviews dataset contains app reviews of the 12 most popular online shopping android apps: Alibaba, Aliexpress, Amazon, Daraz, eBay, Flipcart, Lazada, Meesho, Myntra, Shein, Snapdeal and Walmart. Each review entry contains many metadata like review score, thumbsupcount, review posting time, reply content etc. The dataset is organized in a zip file, under which there are 12 json files for 12 online shopping apps. This dataset can be used to obtain valuable information about customers' feedback regarding their user experience of these financially important apps.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset:N. Thakur, "Twitter Big Data as a Resource for Exoskeleton Research: A Large-Scale Dataset of about 140,000 Tweets from 2017–2022 and 100 Research Questions", Journal of Analytics, Volume 1, Issue 2, 2022, pp. 72-97, DOI: https://doi.org/10.3390/analytics1020007AbstractThe exoskeleton technology has been rapidly advancing in the recent past due to its multitude of applications and diverse use cases in assisted living, military, healthcare, firefighting, and industry 4.0. The exoskeleton market is projected to increase by multiple times its current value within the next two years. Therefore, it is crucial to study the degree and trends of user interest, views, opinions, perspectives, attitudes, acceptance, feedback, engagement, buying behavior, and satisfaction, towards exoskeletons, for which the availability of Big Data of conversations about exoskeletons is necessary. The Internet of Everything style of today’s living, characterized by people spending more time on the internet than ever before, with a specific focus on social media platforms, holds the potential for the development of such a dataset by the mining of relevant social media conversations. Twitter, one such social media platform, is highly popular amongst all age groups, where the topics found in the conversation paradigms include emerging technologies such as exoskeletons. To address this research challenge, this work makes two scientific contributions to this field. First, it presents an open-access dataset of about 140,000 Tweets about exoskeletons that were posted in a 5-year period from 21 May 2017 to 21 May 2022. Second, based on a comprehensive review of the recent works in the fields of Big Data, Natural Language Processing, Information Retrieval, Data Mining, Pattern Recognition, and Artificial Intelligence that may be applied to relevant Twitter data for advancing research, innovation, and discovery in the field of exoskeleton research, a total of 100 Research Questions are presented for researchers to study, analyze, evaluate, ideate, and investigate based on this dataset.
https://exactitudeconsultancy.com/privacy-policyhttps://exactitudeconsultancy.com/privacy-policy
The market is projected to be valued at $X million in 2024, driven by factors such as increasing consumer awareness and the rising prevalence of industry-specific trends. The market is expected to grow at a CAGR of Y%, reaching approximately $Z million by 2034.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the scripts and dataset used in the study reported at Mining the Technical Roles of GitHub Users paper. The files are described in more detailed below:
processed_ground_truth.csv: A CSV file with the information of the developers considered in the study. Due to privacy issues, we already preprocessed the dataset to remove identification clues. Please contact the authors in case you need the original one.
processed_ground_truth_fullstack.csv: Same CSV file but with fullstack developers.
script.ipynb, utils.py: Source code of the script used in our study.
Dockerfile, docker-compose.yml, requirements.txt: Files to replicate the code environment used in this study.
BoW-tuning.csv: List of classifications results for different bag of words parameters.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Travel regions are not necessarily defined by political or administrative boundaries. For example, in the Schengen region of Europe, tourists can travel freely across borders irrespective of national borders. Identifying transboundary travel regions is an interesting problem which we aim to solve using mobility analysis of Twitter users. Our proposed solution comprises collecting geotagged tweets, combining them into trajectories and, thus, mining thousands of trips undertaken by twitter users. After aggregating these trips into a mobility graph, we apply a community detection algorithm to find coherent regions throughout the world. The discovered regions provide insights into international travel and can reveal both domestic and transnational travel regions.
https://www.zionmarketresearch.com/privacy-policyhttps://www.zionmarketresearch.com/privacy-policy
Global Data Warehouse as a Service (DWaaS) Market valued at USD 5.03 Billion in 2023 and is predicted to USD 30.37 Billion by 2032, with a CAGR of 22.1%.
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Automotive Artificial Intelligence Market size was valued at USD 2.3 Billion in 2024 and is projected to reach USD 12.94 Billion by 2031, growing at a CAGR of 24.1% from 2024 to 2031.
Global Automotive Artificial Intelligence Market Drivers
Growing Need for Autonomous Vehicles (AVs): The growing need for autonomous vehicles is one of the main factors propelling the automotive artificial intelligence industry. Artificial Intelligence plays a major role in AVs’ ability to perceive, plan, and control. The need for automotive AI is anticipated to increase in tandem with the maturation of AV technology and the growing acceptance of AVs by consumers.
Increasing Adoption of Advanced Driver-Assistance Systems (ADAS): ADAS refers to a group of technologies that automate or support driving operations via the use of sensors and software. Features including adaptive cruise control, lane departure warning, and automated emergency braking are included in these systems. The need for automotive AI is being driven by the increasing use of ADAS, these systems need AI skills to work well.
Tight Government Restrictions for Safe Driving: Tight government restrictions for safe driving are being implemented by governments all over the world. Automotive AI is becoming more and more necessary as a result of these rules, which are also driving the development of ADAS and other safety technologies in cars.
Focus on Convenience Features and Improved User Experience: Cars with amenities that make driving more pleasurable and convenient are becoming more and more in demand from consumers. Voice recognition, in-car personalization, and gesture control are just a few of the AI-powered features that are gaining popularity. The market for automotive AI is anticipated to continue growing as a result of this trend.
Major OEM Investments: Major automakers are making significant investments in the advancement of artificial intelligence (AI) technologies for their automobiles. The creation of fresh, cutting-edge AI-powered automotive features is accelerating thanks to these investments.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mining opinions from reviews has been a field of ever-growing research. These include mining opinions on document level, sentence-level, and even aspect level of a review. While explicitly mentioned aspects in a review have been widely researched, very little work has been done in gathering opinions on aspects that are implied and not explicitly mentioned. E.g. “the flight was spacious and there was plenty of legroom”. This gives an opinion on the entities of the cabin and seat of an airline. Words like “spacious” and phrases like “plenty of legroom” help identify these implied entities and the opinions attached to them. Not much research has been done for gathering such implicit aspects and opinions for airline reviews. The present dataset is a manually annotated domain-specific aspect-based corpus that helps a study to extract and analyze opinions about such implied aspects and entities of airlines.
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Fleet Management Tool For Mining Market size was valued at USD 3.5 Billion in 2023 and is projected to reach USD 6.8 Billion by 2031, growing at a CAGR of 9.5% during the forecasted period 2024 to 2031.
Global Fleet Management Tool For Mining Market Drivers
The market drivers for the Fleet Management Tool For Mining Market can be influenced by various factors. These may include:
• Increased Demand for Operational Efficiency: Mining companies are seeking to improve efficiency and productivity in their operations. Fleet management tools help optimize fleet performance, reduce downtime, and ensure timely maintenance, leading to cost savings and improved operational efficiency.
• Technological Advancements: The development of advanced technologies such as IoT, GPS, and real-time data analytics has significantly enhanced fleet management capabilities. These technologies enable better tracking, monitoring, and management of mining fleets, driving the adoption of fleet management tools.
Global Fleet Management Tool For Mining Market Restraints
Several factors can act as restraints or challenges for the Fleet Management Tool For Mining Market. These may include:
• High Initial Investment: The cost of implementing advanced fleet management tools can be significant, including expenses for software, hardware, and integration with existing systems. This high upfront investment may deter smaller mining companies from adopting these technologies.
• Complexity of Integration: Integrating fleet management tools with existing mining operations and equipment can be complex and time-consuming. This complexity may lead to resistance from companies accustomed to their current systems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset with 72000 pins from 117 users in Pinterest. Each pin contains a short raw text and an image. The images are processed using a pretrained Convolutional Neural Network and transformed into a vector of 4096 features.
This dataset was used in the paper "User Identification in Pinterest Through the Refinement of a Cascade Fusion of Text and Images" to idenfity specific users given their comments. The paper is publishe in the Research in Computing Science Journal, as part of the LKE 2017 conference. The dataset includes the splits used in the paper.
There are nine files. text_test, text_train and text_val, contain the raw text of each pin in the corresponding split of the data. imag_test, imag_train and imag_val contain the image features of each pin in the corresponding split of the data. train_user and val_test_users contain the index of the user of each pin (between 0 and 116). There is a correspondance one-to-one among the test, train and validation files for images, text and users. There are 400 pins per user in the train set, and 100 pins per user in the validation and test sets each one.
If you have questions regarding the data, write to: jc dot gomez at ugto dot mx