Facebook
TwitterIn 2020, according to respondents surveyed, data masters typically leverage a variety of external data sources to enhance their insights. The most popular external data sources for data masters being publicly available competitor data, open data, and proprietary datasets from data aggregators, with **, **, and ** percent, respectively.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To enhance efficiency in drug development, interest in augmenting randomized controlled trials by supplementing the control arm with external data has grown rapidly. However, external data may lack between-population exchangeability. To facilitate proper information borrowing, we propose two two-stage strategies: the stratified propensity score self-adaptive mixture (SPS-SAM) prior and stratified propensity score calibrated elastic mixture (SPS-CEM) prior. The mixture prior is composed of an informative meta-analytic predictive (MAP) prior and a vague prior. In the first stage, propensity scores (PS) stratification is performed to select similar subjects from external data. Within each stratum, to mitigate the measured confounding, we calculate the PS overlap coefficient to account for the between-group heterogeneity by adjusting the hyperparameters of the MAP prior. In the second stage, to reduce unmeasured confounding and address potential prior-data conflict, we construct a data-driven mixture prior incorporating an adaptive weight that dynamically controls the proportion of the MAP prior. To obtain the adaptive weight measuring the extent of congruence between the current and the external data, SPS-SAM prior uses the likelihood ratio test and SPS-CEM prior uses the scaled t-test, respectively. Compared with existing methods, simulations studies and illustrative examples demonstrate the superior features of the proposed methods. Both proposed methods outperform existing methods by yielding smaller bias, greater calibrated power, and achieving accurate, efficient, and robust estimation of the treatment effect.
Facebook
TwitterThe purpose of building aDGAclassifier isn't specifically for takedowns of botnets, but to discover and detect the use on our network or services. If we can you have a list of domains resolved and accessed at your organization, it is possible now to see which of those are potentially generated and used bymalware.
The dataset consists of three sources (as decribed in the Data-Driven Security blog):
Alexa: For samples of legitimate domains, an obvious choice is to go to the Alexa list of top web sites. But it's not ready for our use as is. If you grab thetop 1 Million Alexa domainsand parse it, you'll find just over 11 thousand are full URLs and not just domains, and there are thousands of domains with subdomains that don't help us (we are only classifying on domains here). So after I remove the URLs, de-duplicated the domains and clean it up, I end up with the Alexa top965,843.
"Real World" Data fromOpenDNS: After reading the post from Frank Denis at OpenDNS titled"Why Using Real World Data Matters For Building Effective Security Models", I grabbed their10,000 Top Domainsand their10,000 Random samples. If we compare that to the top Alexa domains, 6,901 of the top ten thousand are in the alexa data and 893 of the random domains are in the Alexa data. I will clean that up as I make the final training dataset.
DGAdo: The Click Security version wasn't very clear in where they got their bad domains so I decided to collect my own and this was rather fun. Because I work with some interesting characters (who know interesting characters), I was able to collect several data sets from recent botnets: "Cryptolocker", two seperate "Game-Over Zues" algorithms, and an anonymous collection of malicious (and algorithmically generated) domains. In the end, I was able to collect 73,598 algorithmically generateddomains.
;
Facebook
TwitterOur People data is gathered and aggregated via surveys, digital services, and public data sources. We use powerful profiling algorithms to collect and ingest only fresh and reliable data points.
Our comprehensive data enrichment solution includes a variety of data sets that can help you address gaps in your People data, gain a deeper understanding of your customers, and power superior client experiences. 1. Geography - City, State, ZIP, County, CBSA, Census Tract, etc. 2. Demographics - Gender, Age Group, Marital Status, Language etc. 3. Financial - Income Range, Credit Rating Range, Credit Type, Net worth Range, etc 4. Persona - Consumer type, Communication preferences, Family type, etc 5. Interests - Content, Brands, Shopping, Hobbies, Lifestyle etc. 6. Household - Number of Children, Number of Adults, IP Address, etc. 7. Behaviours - Brand Affinity, App Usage, Web Browsing etc. 8. Firmographics - Industry, Company, Occupation, Revenue, etc 9. Retail Purchase - Store, Category, Brand, SKU, Quantity, Price etc. 10. Auto - Car Make, Model, Type, Year, etc. 11. Housing - Home type, Home value, Renter/Owner, Year Built etc.
People Data Schema & Reach: Our data reach represents the total number of counts available within various categories and comprises attributes such as country location, MAU, DAU & Monthly Location Pings:
Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method on a suitable interval (daily/weekly/monthly).
People Data Use Cases: 360-Degree Customer View: Get a comprehensive image of customers by the means of internal and external data aggregation.
Data Enrichment: Leverage Online to offline consumer profiles to build holistic audience segments to improve campaign targeting using user data enrichment
Fraud Detection: Use multiple digital (web and mobile) identities to verify real users and detect anomalies or fraudulent activity. Advertising & Marketing: Understand audience demographics, interests, lifestyle, hobbies, and behaviors to build targeted marketing campaigns.
Here's the schema of People Data:
person_id
first_name
last_name
age
gender
linkedin_url
twitter_url
facebook_url
city
state
address
zip
zip4
country
delivery_point_bar_code
carrier_route
walk_seuqence_code
fips_state_code
fips_country_code
country_name
latitude
longtiude
address_type
metropolitan_statistical_area
core_based+statistical_area
census_tract
census_block_group
census_block
primary_address
pre_address
streer
post_address
address_suffix
address_secondline
address_abrev
census_median_home_value
home_market_value
property_build+year
property_with_ac
property_with_pool
property_with_water
property_with_sewer
general_home_value
property_fuel_type
year
month
household_id
Census_median_household_income
household_size
marital_status
length+of_residence
number_of_kids
pre_school_kids
single_parents
working_women_in_house_hold
homeowner
children
adults
generations
net_worth
education_level
occupation
education_history
credit_lines
credit_card_user
newly_issued_credit_card_user
credit_range_new
credit_cards
loan_to_value
mortgage_loan2_amount
mortgage_loan_type
mortgage_loan2_type
mortgage_lender_code
mortgage_loan2_render_code
mortgage_lender
mortgage_loan2_lender
mortgage_loan2_ratetype
mortgage_rate
mortgage_loan2_rate
donor
investor
interest
buyer
hobby
personal_email
work_email
devices
phone
employee_title
employee_department
employee_job_function
skills
recent_job_change
company_id
company_name
company_description
technologies_used
office_address
office_city
office_country
office_state
office_zip5
office_zip4
office_carrier_route
office_latitude
office_longitude
office_cbsa_code
office_census_block_group
office_census_tract
office_county_code
company_phone
company_credit_score
company_csa_code
company_dpbc
company_franchiseflag
company_facebookurl
company_linkedinurl
company_twitterurl
company_website
company_fortune_rank
company_government_type
company_headquarters_branch
company_home_business
company_industry
company_num_pcs_used
company_num_employees
company_firm_individual
company_msa
company_msa_name
company_naics_code
company_naics_description
company_naics_code2
company_naics_description2
company_sic_code2
company_sic_code2_description
company_sic_code4
company_sic_code4_description
company_sic_code6
company_sic_code6_description
company_sic_code8
company_sic_code8_description
company_parent_company
company_parent_company_location
company_public_private
company_subsidiary_company
company_residential_business_code
company_revenue_at_side_code
company_revenue_range
company_revenue
company_sales_volume
company_small_business
company_stock_ticker
company_year_founded
company_minorityowned
company_female_owned_or_operated
company_franchise_code
company_dma
company_dma_name
company_hq_address
company_hq_city
company_hq_duns
company_hq_state
company_hq_zip5
company_hq_zip4
company_sect...
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Business Information Market Size 2025-2029
The business information market size is forecast to increase by USD 79.6 billion, at a CAGR of 7.3% between 2024 and 2029.
The market is characterized by the increasing demand for customer-centric solutions as enterprises adapt to evolving customer preferences. This shift necessitates the provision of real-time, accurate, and actionable insights to facilitate informed decision-making. However, this market landscape is not without challenges. The threat of data misappropriation and theft looms large, necessitating robust security measures to safeguard sensitive business information. As businesses continue to digitize their operations and rely on external data sources, ensuring data security becomes a critical success factor. Companies must invest in advanced security technologies and implement stringent data protection policies to mitigate these risks. Navigating this complex market requires a strategic approach that balances the need for customer-centric solutions with the imperative to secure valuable business data.
What will be the Size of the Business Information Market during the forecast period?
Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free Sample
In today's data-driven business landscape, the continuous and evolving nature of market dynamics plays a pivotal role in shaping various sectors. Data integration solutions enable seamless data flow between different systems, enhancing cloud-based business applications' functionality. Data quality management ensures data accuracy and consistency, crucial for strategic planning and customer segmentation. Data infrastructure, data warehousing, and data pipelines form the backbone of business intelligence, facilitating data storytelling and digital transformation. Data lineage and data mining reveal valuable insights, fueling data analytics platforms and business intelligence infrastructure. Data privacy regulations necessitate robust data management tools, ensuring compliance and protecting sensitive information.
Sales forecasting and business intelligence consulting offer valuable industry analysis and data-driven decision making. Data governance frameworks and data cataloging maintain order and ethics in the vast expanse of big data analytics. Machine learning algorithms, predictive analytics, and real-time analytics drive business intelligence reporting and process modeling, leading to business process optimization and financial reporting software. Sentiment analysis and marketing automation cater to customer needs, while lead generation and data ethics ensure ethical business practices. The ongoing unfolding of market activities and evolving patterns necessitate the integration of various tools and frameworks, creating a dynamic interplay that fuels business growth and innovation.
How is this Business Information Industry segmented?
The business information industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.
End-user
BFSI
Healthcare and life sciences
Manufacturing
Retail
Others
Application
B2B
B2C
Geography
North America
US
Canada
Europe
France
Germany
Italy
UK
APAC
China
India
Japan
South Korea
Rest of World (ROW).
By End-user Insights
The bfsi segment is estimated to witness significant growth during the forecast period.
In the dynamic business landscape, data-driven insights have become essential for strategic planning and decision-making across various industries. The market caters to this demand by offering solutions that integrate and manage data from multiple sources. These include cloud-based business applications, data quality management tools, data warehousing, data pipelines, and data analytics platforms. Data storytelling and digital transformation are key trends driving the market's growth, enabling businesses to derive meaningful insights from their data. Data governance frameworks and policies are crucial components of the business intelligence infrastructure. Data privacy regulations, such as GDPR and HIPAA, are shaping the market's development.
Data mining, predictive analytics, and machine learning algorithms are increasingly being used for sales forecasting, customer segmentation, and churn prediction. Business intelligence consulting and industry analysis provide valuable insights for organizations seeking competitive advantage. Data visualization dashboards, market research databases, and data discovery tools facilitate data-driven decision making. Sentiment analysis and predictive analytics are essential for marketing automation and business process
Facebook
TwitterThe accumulated dataset combines some botnet samples from the Android Genome Malware project, malware security blog, VirusTotal and samples provided by well-known anti-malware vendor. Overall, the dataset includes 1929 samples spawning a period of 2010 (the first appearance of Android botnet) to 2014.
The Android Botnet dataset consists of 14 families:
Family, Year of discovery, No. of samples
AnserverBot, 2011, 244
Bmaster, 2012, 6
DroidDream, 2011, 363
Geinimi, 2010, 264
MisoSMS, 2013, 100
NickySpy, 2011, 199
Not Compatible, 2014, 76
PJapps, 2011, 244
Pletor, 2014, 85
RootSmart, 2012, 28
Sandroid, 2014, 44
TigerBot, 2012, 96
Wroba, 2014, 100
Zitmo, 2010, 80
; cic@unb.ca.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This RAG: Financial & Legal Retrieval-Augmented-Generation Benchmark Evaluation Dataset provides a unique opportunity for professionals in the legal and financial industries to analyze the latest retrieval augmented generation (RAG) technology. With 200 diverse samples that contains both a relevant context passage and a related question, it is an invaluable assessment tool to measure different capabilities of retrieval augmented generation enterprise use cases. Whether you are looking to optimize Core Q&A, classify Not Found topics, apply Boolean Yes/No principles, delve into deep math equations, explore complex Q&A inquiries or summarize core principles – this dataset is here provide all of these tasks in an accurate and efficient manner. Illuminating solutions from robust questions and context passages, this is a benchmark for advanced techniques across all areas of legal & financial services which will allow decision-makers full insight into retrieval augmented generation technology
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
- Explore the dataset by examining the columns listed above: query, answer, sample_number and tokens; and also take a look at the category of each sample.
- Create hypotheses using a sample question from one of the categories you are interested in studying more closely. Formulate questions that relate directly to your hypothesis using more or fewer variables from this dataset as well as others you may find useful for your particular research needs.
- Take into account any limitations or assumptions that may exist related to either this set’s data or any other related external sources when crafting research questions based upon this dataset’s data schema or content: before formulating any conclusions be sure to double check your work with reliable references on hand!
- Utilize statistics analysis tools such as correlation coefficients (i..e r), linear regression equations (slope/intercept) and scatter plots (or other visualizations) if necessary– prioritizing one variable from each category over another should be handled accordingly within context what would best suit your research needs given these limitation constraints! As mentioned earlier additional external data might come into play here too — remember keep records all evidence for future reference purposes! 5 .Refine specific questions and develop an experimental setup wherein promising results can begin testing theories with improved accuracy — note whether failures occurred due too trivial errors taken during human analytical processing outlier distortion produced by manipulated outliers / variables accompanied by deflated explanatory power leading up these erroneous outcomes on their own according's subject matter expertise level difficulty settings versus expected mean standard deviations etc.. Reforming further experiments around other more accurate working models involving this same series' empirical studies should continuously reviewed if needed – linking back core findings associated with initial input(s)! Advice recommended prior engaging research emphasis involves breaking individual questing resolving into smaller subtasks continuingly providing measurable evidence explains large scale phenomena in terms once those analyzed better comprehended domain professionals evaluated current progress undergone since prior iteration trials begun had formerly scoped examine subcomponents separated them one part discuss branch individual components related discussed subsequent progression stages between sections backdrop applicable aspects... Pruning methods utilized slim down information Thus while Working Develop Practical
- Utilizing the tokens to create a sophisticated text-summarization network for automatic summarization of legal documents.
- Training models to recognize problems for which there may not be established answers/solutions yet, and estimate future outcomes based on data trends an patterns with machine learning algorithms.
- Analyzing the dataset to determine keywords, common topics or key issues related to financial and legal services that can be used in enterprise decision making operations
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Unive...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains PDF-to-text conversions of scientific research articles, prepared for the task of data citation mining. The goal is to identify references to research datasets within full-text scientific papers and classify them as Primary (data generated in the study) or Secondary (data reused from external sources).
The PDF articles were processed using MinerU, which converts scientific PDFs into structured machine-readable formats (JSON, Markdown, images). This ensures participants can access both the raw text and layout information needed for fine-grained information extraction.
Each paper directory contains the following files:
*_origin.pdf
The original PDF file of the scientific article.
*_content_list.json
Structured extraction of the PDF content, where each object represents a text or figure element with metadata.
Example entry:
{
"type": "text",
"text": "10.1002/2017JC013030",
"text_level": 1,
"page_idx": 0
}
full.md
The complete article content in Markdown format (linearized for easier reading).
images/
Folder containing figures and extracted images from the article.
layout.json
Page layout metadata, including positions of text blocks and images.
The aim is to detect dataset references in the article text and classify them:
DOIs (Digital Object Identifiers):
https://doi.org/[prefix]/[suffix]
Example: https://doi.org/10.5061/dryad.r6nq870
Accession IDs: Used by data repositories. Format varies by repository. Examples:
GSE12345 (NCBI GEO)PDB 1Y2T (Protein Data Bank)E-MEXP-568 (ArrayExpress)Each dataset mention must be labeled as:
train_labels.csv).train_labels.csv → Ground truth with:
article_id: Research paper DOI.dataset_id: Extracted dataset identifier.type: Citation type (Primary / Secondary).sample_submission.csv → Example submission format.
Paper: https://doi.org/10.1098/rspb.2016.1151 Data: https://doi.org/10.5061/dryad.6m3n9 In-text span:
"The data we used in this publication can be accessed from Dryad at doi:10.5061/dryad.6m3n9." Citation type: Primary
This dataset enables participants to develop and test NLP systems for:
Facebook
TwitterAs a first step in understanding law enforcement agencies' use and knowledge of crime mapping, the Crime Mapping Research Center (CMRC) of the National Institute of Justice conducted a nationwide survey to determine which agencies were using geographic information systems (GIS), how they were using them, and, among agencies that were not using GIS, the reasons for that choice. Data were gathered using a survey instrument developed by National Institute of Justice staff, reviewed by practitioners and researchers with crime mapping knowledge, and approved by the Office of Management and Budget. The survey was mailed in March 1997 to a sample of law enforcement agencies in the United States. Surveys were accepted until May 1, 1998. Questions asked of all respondents included type of agency, population of community, number of personnel, types of crimes for which the agency kept incident-based records, types of crime analyses conducted, and whether the agency performed computerized crime mapping. Those agencies that reported using computerized crime mapping were asked which staff conducted the mapping, types of training their staff received in mapping, types of software and computers used, whether the agency used a global positioning system, types of data geocoded and mapped, types of spatial analyses performed and how often, use of hot spot analyses, how mapping results were used, how maps were maintained, whether the department kept an archive of geocoded data, what external data sources were used, whether the agency collaborated with other departments, what types of Department of Justice training would benefit the agency, what problems the agency had encountered in implementing mapping, and which external sources had funded crime mapping at the agency. Departments that reported no use of computerized crime mapping were asked why that was the case, whether they used electronic crime data, what types of software they used, and what types of Department of Justice training would benefit their agencies.
Facebook
Twitter
According to our latest research, the global market size for Third-Party Data Enrichment for Insurance reached USD 2.1 billion in 2024, with a robust year-on-year growth momentum. The market is expected to expand at a CAGR of 13.2% from 2025 to 2033, culminating in a projected value of USD 6.2 billion by 2033. This dynamic growth is primarily driven by the increasing need for insurance companies to enhance customer profiling, risk assessment, and fraud detection through advanced data analytics and external data sources. As per our latest research, insurers are rapidly adopting third-party data enrichment solutions to gain a competitive edge, improve operational efficiency, and deliver personalized services in a highly regulated and customer-centric environment.
A key growth factor propelling the Third-Party Data Enrichment for Insurance market is the exponential increase in the volume and variety of data available from external sources. Insurers are leveraging demographic, firmographic, technographic, and behavioral data to gain deeper insights into customer needs, preferences, and risk profiles. The integration of third-party data allows for more accurate underwriting, dynamic pricing, and targeted marketing strategies, thereby reducing loss ratios and improving profitability. Furthermore, the proliferation of digital channels and the rise of insurtech startups have intensified competition, compelling traditional insurers to invest in advanced data enrichment solutions to stay relevant and agile in a rapidly evolving marketplace.
Another significant driver is the growing prevalence of digital fraud and cyber threats, which has heightened the need for robust fraud detection and risk assessment mechanisms. Third-party data enrichment empowers insurers to validate customer identities, detect anomalies, and flag suspicious activities in real time. This capability is particularly crucial in the context of online policy issuance and claims management, where the risk of fraudulent transactions is substantially higher. Additionally, regulatory requirements such as Know Your Customer (KYC) and Anti-Money Laundering (AML) have made it imperative for insurers to access comprehensive and up-to-date external data sources to ensure compliance and mitigate financial crime risks.
The ongoing digital transformation across the insurance industry is further accelerating the adoption of third-party data enrichment solutions. As insurers transition from legacy systems to cloud-based platforms, they are increasingly seeking scalable and flexible data enrichment tools that can seamlessly integrate with their core systems. The emergence of artificial intelligence, machine learning, and big data analytics has enabled insurers to extract actionable insights from vast and disparate datasets, thereby enhancing decision-making processes across the value chain. Moreover, partnerships between insurers and data providers are fostering innovation and enabling the development of tailored solutions that address specific industry challenges and customer expectations.
Regionally, North America commands the largest share of the Third-Party Data Enrichment for Insurance market, driven by the presence of leading insurance companies, advanced IT infrastructure, and a high degree of digital adoption. Europe follows closely, with stringent regulatory frameworks and a strong focus on data privacy and security. The Asia Pacific region is witnessing the fastest growth, fueled by rising insurance penetration, rapid urbanization, and increasing investments in digital technologies. Latin America and the Middle East & Africa are also emerging as promising markets, supported by ongoing regulatory reforms and the growing adoption of insurtech solutions. Overall, the global market is characterized by intense competition, continuous innovation, and a strong emphasis on data-driven decision-making.
The Component segmen
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Cloud Data Warehouse Market Size 2025-2029
The cloud data warehouse market size is forecast to increase by USD 63.91 billion at a CAGR of 43.3% between 2024 and 2029.
The market is experiencing significant growth, driven by the increasing penetration of IoT-enabled devices generating vast amounts of data. This data requires efficient storage and analysis, making cloud data warehouses an attractive solution due to their scalability and flexibility. Additionally, the growing need for edge computing further fuels market expansion, as organizations seek to process data closer to its source in real-time. However, challenges persist in the form of company lock-in issues, where businesses may find it difficult to migrate their data from one cloud provider to another, potentially limiting their flexibility and strategic options.
To capitalize on market opportunities and navigate challenges effectively, companies must stay informed of emerging trends and adapt their strategies accordingly. By focusing on interoperability and data portability, they can mitigate lock-in risks and maintain agility in their data management strategies. The market is experiencing significant growth due to several key trends. The increasing penetration of Internet of Things (IoT) devices is driving the need for more efficient data management solutions, leading to the adoption of cloud data warehouses.
What will be the Size of the Cloud Data Warehouse Market during the forecast period?
Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free Sample
In the dynamic market, businesses seek efficient solutions for managing and analyzing their data. Data visualization tools and business intelligence platforms enable users to gain insights through interactive dashboards and reports. Data automation tools streamline data processing, while data enrichment tools enhance data quality by adding external data sources. Data virtualization tools provide a unified view of data from various sources, and data integration tools ensure seamless data flow between systems. NoSQL databases and big data platforms offer scalability and flexibility for handling large volumes of data. Data cleansing tools eliminate errors and inconsistencies, while data encryption tools secure sensitive data.
Data migration tools facilitate moving data between systems, and data validation tools ensure data accuracy. Real-time analytics platforms and predictive analytics platforms provide insights in near real-time, while prescriptive analytics platforms suggest actions based on data trends. Data deduplication tools eliminate redundant data, and data governance tools ensure compliance with regulations. Data orchestration tools manage workflows, and data science platforms facilitate machine learning and artificial intelligence applications. Data archiving tools store historical data, and data pipeline tools manage data movement between systems. Data fabric and data standardization tools ensure data consistency across the organization, while data replication tools maintain data availability and disaster recovery.
How is this Cloud Data Warehouse Industry segmented?
The cloud data warehouse industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.
Industry Application
Large enterprises
SMEs
Deployment
Public
Private
End-user
Cloud server provider
IT and ITES
BFSI
Retail
Others
Application
Customer analytics
Business intelligence
Data modernization
Operational analytics
Predictive analytics
Geography
North America
US
Canada
Mexico
Europe
France
Germany
Italy
UK
APAC
China
India
Japan
Rest of World (ROW)
By Industry Application Insights
The large enterprises segment is estimated to witness significant growth during the forecast period. In today's business landscape, cloud data warehouse solutions have gained significant traction among large enterprises, enabling them to efficiently manage and process data across various industries and geographies. Traditional on-premises data warehouses come with high costs due to the need for expensive hardware and physical space. Cloud-based alternatives offer a more cost-effective and convenient solution, allowing organizations to access tools and information remotely and streamline document sharing between multiple workplaces. Predictive analytics, data cost optimization, and data discovery are key drivers for cloud data warehouse adoption. These technologies offer insights into data trends and patterns, helping businesses make data-driven decisions.
Data timeliness and data standardization ar
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Algorithms proposed in computational pathology can allow to automatically analyze digitized tissue samples of histopathological images to help diagnosing diseases. Tissue samples are scanned at a high-resolution and usually saved as images with several magnification levels, namely whole slide images (WSIs). Convolutional neural networks (CNNs) represent the state-of-the-art computer vision methods targeting the analysis of histopathology images, aiming for detection, classification and segmentation. However, the development of CNNs that work with multi-scale images such as WSIs is still an open challenge. The image characteristics and the CNN properties impose architecture designs that are not trivial. Therefore, single scale CNN architectures are still often used. This paper presents Multi_Scale_Tools, a library aiming to facilitate exploiting the multi-scale structure of WSIs. Multi_Scale_Tools currently include four components: a pre-processing component, a scale detector, a multi-scale CNN for classification and a multi-scale CNN for segmentation of the images. The pre-processing component includes methods to extract patches at several magnification levels. The scale detector allows to identify the magnification level of images that do not contain this information, such as images from the scientific literature. The multi-scale CNNs are trained combining features and predictions that originate from different magnification levels. The components are developed using private datasets, including colon and breast cancer tissue samples. They are tested on private and public external data sources, such as The Cancer Genome Atlas (TCGA). The results of the library demonstrate its effectiveness and applicability. The scale detector accurately predicts multiple levels of image magnification and generalizes well to independent external data. The multi-scale CNNs outperform the single-magnification CNN for both classification and segmentation tasks. The code is developed in Python and it will be made publicly available upon publication. It aims to be easy to use and easy to be improved with additional functions.
Facebook
TwitterOur People data is gathered and aggregated via surveys, digital services, and public data sources. We use powerful profiling algorithms to collect and ingest only fresh and reliable data points.
Our comprehensive data enrichment solution includes a variety of data sets that can help you address gaps in your People data, gain a deeper understanding of your customers, and power superior client experiences. 1. Geography - City, State, ZIP, County, CBSA, Census Tract, etc. 2. Demographics - Gender, Age Group, Marital Status, Language etc. 3. Financial - Income Range, Credit Rating Range, Credit Type, Net worth Range, etc 4. Persona - Consumer type, Communication preferences, Family type, etc 5. Interests - Content, Brands, Shopping, Hobbies, Lifestyle etc. 6. Household - Number of Children, Number of Adults, IP Address, etc. 7. Behaviours - Brand Affinity, App Usage, Web Browsing etc. 8. Firmographics - Industry, Company, Occupation, Revenue, etc 9. Retail Purchase - Store, Category, Brand, SKU, Quantity, Price etc. 10. Auto - Car Make, Model, Type, Year, etc. 11. Housing - Home type, Home value, Renter/Owner, Year Built etc.
People Data Schema & Reach: Our data reach represents the total number of counts available within various categories and comprises attributes such as country location, MAU, DAU & Monthly Location Pings:
Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method on a suitable interval (daily/weekly/monthly).
People Data Use Cases: 360-Degree Customer View: Get a comprehensive image of customers by the means of internal and external data aggregation.
Data Enrichment: Leverage Online to offline consumer profiles to build holistic audience segments to improve campaign targeting using user data enrichment
Fraud Detection: Use multiple digital (web and mobile) identities to verify real users and detect anomalies or fraudulent activity.
Advertising & Marketing: Understand audience demographics, interests, lifestyle, hobbies, and behaviors to build targeted marketing campaigns.
Here's the schema of People Data:
person_id
first_name
last_name
age
gender
linkedin_url
twitter_url
facebook_url
city
state
address
zip
zip4
country
delivery_point_bar_code
carrier_route
walk_seuqence_code
fips_state_code
fips_country_code
country_name
latitude
longtiude
address_type
metropolitan_statistical_area
core_based+statistical_area
census_tract
census_block_group
census_block
primary_address
pre_address
streer
post_address
address_suffix
address_secondline
address_abrev
census_median_home_value
home_market_value
property_build+year
property_with_ac
property_with_pool
property_with_water
property_with_sewer
general_home_value
property_fuel_type
year
month
household_id
Census_median_household_income
household_size
marital_status
length+of_residence
number_of_kids
pre_school_kids
single_parents
working_women_in_house_hold
homeowner
children
adults
generations
net_worth
education_level
occupation
education_history
credit_lines
credit_card_user
newly_issued_credit_card_user
credit_range_new
credit_cards
loan_to_value
mortgage_loan2_amount
mortgage_loan_type
mortgage_loan2_type
mortgage_lender_code
mortgage_loan2_render_code
mortgage_lender
mortgage_loan2_lender
mortgage_loan2_ratetype
mortgage_rate
mortgage_loan2_rate
donor
investor
interest
buyer
hobby
personal_email
work_email
devices
phone
employee_title
employee_department
employee_job_function
skills
recent_job_change
company_id
company_name
company_description
technologies_used
office_address
office_city
office_country
office_state
office_zip5
office_zip4
office_carrier_route
office_latitude
office_longitude
office_cbsa_code
office_census_block_group
office_census_tract
office_county_code
company_phone
company_credit_score
company_csa_code
company_dpbc
company_franchiseflag
company_facebookurl
company_linkedinurl
company_twitterurl
company_website
company_fortune_rank
company_government_type
company_headquarters_branch
company_home_business
company_industry
company_num_pcs_used
company_num_employees
company_firm_individual
company_msa
company_msa_name
company_naics_code
company_naics_description
company_naics_code2
company_naics_description2
company_sic_code2
company_sic_code2_description
company_sic_code4
company_sic_code4_description
company_sic_code6
company_sic_code6_description
company_sic_code8
company_sic_code8_description
company_parent_company
company_parent_company_location
company_public_private
company_subsidiary_company
company_residential_business_code
company_revenue_at_side_code
company_revenue_range
company_revenue
company_sales_volume
company_small_business
company_stock_ticker
company_year_founded
company_minorityowned
company_female_owned_or_operated
company_franchise_code
company_dma
company_dma_name
company_hq_address
company_hq_city
company_hq_duns
company_hq_state
company_hq_zip5
company_hq_zip4
company_sec...
Facebook
Twitteroslo-city-bike License: Norwegian Licence for Open Government Data (NLOD) 2.0 According to the license, we have full rights to collect, use, modify, and distribute this data, provided you clearly indicate the source (which I do).
Folder oslobysykkel contains all available data from 2019 to 2025. Format: oslobysykkel-YYYY-MM.csv. why is oslo still appearing in the file names? because there is also similar data for Trondheim and Bergen
from oslobysykkel.no Variable Format Description started_at Timestamp Timestamp of when the trip started ended_at Timestamp Timestamp of when the trip ended duration Integer Duration of trip in seconds start_station_id String Unique ID for start station start_station_name String Name of start station start_station_description String Description of where start station is located start_station_latitude Decimal degrees in WGS84 Latitude of start station start_station_longitude Decimal degrees in WGS84 Longitude of start station end_station_id String Unique ID for end station end_station_name String Name of end station end_station_description String Description of where end station is located end_station_latitude Decimal degrees in WGS84 Latitude of end station end_station_longitude Decimal degrees in WGS84 Longitude of end station
Please note: this data and my analysis focuses on the new data format, but historical data for the period April 2016 - December 2018 (Legacy Trip Data) has a different pattern.
I myself was extremely fascinated by this open data of Oslo City Bike and in the process of deep analysis saw broad prospects. This interest turned into an idea to create a data-analytical problem book or even platfrom 'exercise bike'. Publishing this dataset to make it convenient for my own further use in the next phases of the project (Clustering, Forecasting), as well as so that anyone can participate in analysis and modeling based on this exciting data.
**Autumn's remake of Oslo bike sharing data analysis ** https://colab.research.google.com/drive/1tAxrIWVK5V-ptKLJBdODjy10zHlsppFv?usp=sharing
https://drive.google.com/file/d/17FP9Bd5opoZlw40LRxWtycgJJyXSAdC6/view
Full notebooks with code, visualizations, and commentary will be published soon! This dataset is the backbone of an ongoing project — stay tuned for see a deeper dives into anomaly detection, station clustering, and interactive learning challenges.
Index of my notebooks Phase 1: Cleaned Data & Core Insights Time-Space Dynamics Exploratory
Clustering and Segmentation Demand Forecasting (Time Series) Geospatial Analysis (Network Analysis)
Similar dataset https://www.kaggle.com/code/florestancharlaix/oslo-city-bikes-analysis
links to works I have found or that have inspired me
Exploring Open Data from Oslo City Bike Jon Olave — visualization of popular routes and seasonality analysis.
Oslo City Bike Data Wrangling Karl Tryggvason — predicting bicycle availability at stations, focusing on everyday use (e.g., trips to kindergarten).
Helsinki City Bikes: Exploratory Data Analysis Analysis of a similar system in Helsinki — useful for comparative studies and methodological ideas.
The idea is to connect with other data. For example I did it for weather data - integrate temperature, precipitation, and wind speed to explain variations in daily demand. https://meteostat.net/en/place/no/oslo
I also used data from Airbnb (that's where I took division into neighbourhoods) https://data.insideairbnb.com/norway/oslo/oslo/2025-06-27/visualisations/neighbourhoods.csv
oslo bike-sharing eda feature-engineering geospatial time-series
Facebook
TwitterTo generate a representative dataset of real-world traffic in ISCX we defined a set of tasks, assuring that our dataset is rich enough in diversity and quantity. We created accounts for users Alice and Bob in order to use services like Skype, Facebook, etc. Below we provide the complete list of different types of traffic and applications considered in our dataset for each traffic type (VoIP, P2P, etc.)
We captured a regular session and a session over VPN, therefore we have a total of 14 traffic categories: VOIP, VPN-VOIP, P2P, VPN-P2P, etc. We also give a detailed description of the different types of traffic generated:
Browsing: Under this label we have HTTPS traffic generated by users while browsing or performing any task that includes the use of a browser. For instance, when we captured voice-calls using hangouts, even though browsing is not the main activity, we captured several browsing flows.
Email: The traffic samples generated using a Thunderbird client, and Alice and Bob Gmail accounts. The clients were configured to deliver mail through SMTP/S, and receive it using POP3/SSL in one client and IMAP/SSL in the other.
Chat: The chat label identifies instant-messaging applications. Under this label we have Facebook and Hangouts via web browsers, Skype, and IAM and ICQ using an application called pidgin [14].
Streaming: The streaming label identifies multimedia applications that require a continuous and steady stream of data. We captured traffic from Youtube (HTML5 and flash versions) and Vimeo services using Chrome and Firefox.
File Transfer: This label identifies traffic applications whose main purpose is to send or receive files and documents. For our dataset we captured Skype file transfers, FTP over SSH (SFTP) and FTP over SSL (FTPS) traffic sessions.
VoIP: The Voice over IP label groups all traffic generated by voice applications. Within this label we captured voice calls using Facebook, Hangouts and Skype.
TraP2P: This label is used to identify file-sharing protocols like Bittorrent. To generate this traffic we downloaded different .torrent files from a public a repository and captured traffic sessions using the uTorrent and Transmission applications.
The traffic was captured using Wireshark and tcpdump, generating a total amount of 28GB of data. For the VPN, we used an external VPN service provider and connected to it using OpenVPN (UDP mode). To generate SFTP and FTPS traffic we also used an external service provider and Filezilla as a client.
To facilitate the labeling process, when capturing the traffic all unnecessary services and applications were closed. (The only application executed was the objective of the capture, e.g., Skype voice-call, SFTP file transfer, etc.) We used a filter to capture only the packets with source or destination IP, the address of the local client (Alice or Bob).
The full research paper outlining the details of the dataset and its underlying principles:
Gerard Drapper Gil, Arash Habibi Lashkari, Mohammad Mamun, Ali A. Ghorbani, "Characterization of Encrypted and VPN Traffic Using Time-Related Features", In Proceedings of the 2nd International Conference on Information Systems Security and Privacy(ICISSP 2016) , pages 407-414, Rome, Italy.
ISCXFlowMeter has been written in Java for reading the pcap files and create the csv file based on selected features. The UNB ISCX Network Traffic (VPN-nonVPN) dataset consists of labeled network traffic, including full packet in pcap format and csv (flows generated by ISCXFlowMeter) also are publicly available for researchers.
For more information contact cic@unb.ca.
The UNB ISCX Network Traffic Dataset content
Traffic: Content
Web Browsing: Firefox and Chrome
Email: SMPTS, POP3S and IMAPS
Chat: ICQ, AIM, Skype, Facebook and Hangouts
Streaming: Vimeo and Youtube
File Transfer: Skype, FTPS and SFTP using Filezilla and an external service
VoIP: Facebook, Skype and Hangouts voice calls (1h duration)
P2P: uTorrent and Transmission (Bittorrent)
; cic@unb.ca.
Facebook
Twitter
According to our latest research, the Data Dividend Platforms market size reached USD 2.13 billion in 2024, with a robust year-on-year growth trajectory. The market is expected to expand at a CAGR of 20.7% from 2025 to 2033, projecting a significant increase to USD 13.73 billion by 2033. This substantial growth is primarily driven by the escalating value of personal and enterprise data, the rising adoption of data monetization solutions, and increasing consumer awareness regarding the potential of leveraging their own data for economic benefits.
The rapid digitalization of economies and the proliferation of connected devices have fueled an exponential increase in data generation. This surge has highlighted the need for secure and transparent mechanisms that allow individuals and organizations to monetize their data assets. The growing demand for Data Dividend Platforms is further propelled by stringent data privacy regulations such as GDPR and CCPA, which empower users with greater control over their personal information. As these regulations become more widespread, both consumers and businesses are seeking platforms that facilitate compliant data exchange while ensuring fair compensation for data providers. This regulatory environment not only enhances trust but also incentivizes participation, thereby accelerating market growth.
Another crucial growth factor is the evolution of data-driven business models across industries. Enterprises are increasingly recognizing the value of external data sources to enhance decision-making, personalize customer experiences, and drive innovation. Data Dividend Platforms enable seamless and secure transactions between data owners and buyers, fostering a transparent ecosystem that benefits all stakeholders. The integration of advanced technologies such as blockchain and artificial intelligence further strengthens these platforms by enhancing data security, automating transactions, and ensuring the authenticity of data exchanges. This technological advancement is a key enabler of market expansion, as it addresses long-standing challenges related to data privacy, ownership, and compensation.
Additionally, the rise of the gig economy and the empowerment of individuals as data creators have created new opportunities for personal data monetization. Consumers are becoming more aware of the value of their digital footprints and are increasingly seeking ways to monetize their data assets. Data Dividend Platforms cater to this demand by providing user-friendly interfaces, transparent revenue-sharing models, and robust privacy controls. This shift towards individual empowerment is expected to drive significant market growth, particularly as digital literacy improves and more people become comfortable with managing and monetizing their personal data.
From a regional perspective, North America currently dominates the Data Dividend Platforms market, accounting for the largest share in 2024, followed by Europe and Asia Pacific. The strong presence of technology giants, early adoption of data monetization models, and a favorable regulatory landscape contribute to North America's leadership. Meanwhile, Asia Pacific is anticipated to witness the highest CAGR over the forecast period, driven by rapid digital transformation, increasing internet penetration, and growing awareness of data rights among consumers and enterprises. Europe remains a key market due to stringent data protection regulations and a mature digital ecosystem, while Latin America and the Middle East & Africa are gradually emerging as promising markets due to ongoing digitalization initiatives and increasing investment in data infrastructure.
The Component segment of the Data Dividend Platforms market is bifurcated into Software and Services. Software solutions form the backbone of these platforms, providing the essential infrastructure for data collection, processing, exchange, and compensation. Robust software
Facebook
TwitterOur People data is gathered and aggregated via surveys, digital services, and public data sources. We use powerful profiling algorithms to collect and ingest only fresh and reliable data points.
Our comprehensive data enrichment solution includes a variety of data sets that can help you address gaps in your customer data, gain a deeper understanding of your customers, and power superior client experiences.
People Data Schema & Reach: Our data reach represents the total number of counts available within various categories and comprises attributes such as country location, MAU, DAU & Monthly Location Pings:
Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method on a suitable interval (daily/weekly/monthly).
People data Use Cases:
360-Degree Customer View: Get a comprehensive image of customers by the means of internal and external data aggregation. Data Enrichment: Leverage Online to offline consumer profiles to build holistic audience segments to improve campaign targeting using user data enrichment Fraud Detection: Use multiple digital (web and mobile) identities to verify real users and detect anomalies or fraudulent activity. Advertising & Marketing: Understand audience demographics, interests, lifestyle, hobbies, and behaviors to build targeted marketing campaigns.
Here's the schema of People Data:
person_id
first_name
last_name
age
gender
linkedin_url
twitter_url
facebook_url
city
state
address
zip
zip4
country
delivery_point_bar_code
carrier_route
walk_seuqence_code
fips_state_code
fips_country_code
country_name
latitude
longtiude
address_type
metropolitan_statistical_area
core_based+statistical_area
census_tract
census_block_group
census_block
primary_address
pre_address
streer
post_address
address_suffix
address_secondline
address_abrev
census_median_home_value
home_market_value
property_build+year
property_with_ac
property_with_pool
property_with_water
property_with_sewer
general_home_value
property_fuel_type
year
month
household_id
Census_median_household_income
household_size
marital_status
length+of_residence
number_of_kids
pre_school_kids
single_parents
working_women_in_house_hold
homeowner
children
adults
generations
net_worth
education_level
occupation
education_history
credit_lines
credit_card_user
newly_issued_credit_card_user
credit_range_new
credit_cards
loan_to_value
mortgage_loan2_amount
mortgage_loan_type
mortgage_loan2_type
mortgage_lender_code
mortgage_loan2_render_code
mortgage_lender
mortgage_loan2_lender
mortgage_loan2_ratetype
mortgage_rate
mortgage_loan2_rate
donor
investor
interest
buyer
hobby
personal_email
work_email
devices
phone
employee_title
employee_department
employee_job_function
skills
recent_job_change
company_id
company_name
company_description
technologies_used
office_address
office_city
office_country
office_state
office_zip5
office_zip4
office_carrier_route
office_latitude
office_longitude
office_cbsa_code
office_census_block_group
office_census_tract
office_county_code
company_phone
company_credit_score
company_csa_code
company_dpbc
company_franchiseflag
company_facebookurl
company_linkedinurl
company_twitterurl
company_website
company_fortune_rank
company_government_type
company_headquarters_branch
company_home_business
company_industry
company_num_pcs_used
company_num_employees
company_firm_individual
company_msa
company_msa_name
company_naics_code
company_naics_description
company_naics_code2
company_naics_description2
company_sic_code2
company_sic_code2_description
company_sic_code4
company_sic_code4_description
company_sic_code6
company_sic_code6_description
company_sic_code8
company_sic_code8_description
company_parent_company
company_parent_company_location
company_public_private
company_subsidiary_company
company_residential_business_code
company_revenue_at_side_code
company_revenue_range
company_revenue
company_sales_volume
company_small_business
company_stock_ticker
company_year_founded
company_minorityowned
company_female_owned_or_operated
company_franchise_code
company_dma
company_dma_name
company_hq_address
company_hq_city
company_hq_duns
company_hq_state
company_hq_zip5
company_hq_zip4
company_se...
Facebook
TwitterThis dataset is comprised of a collection of example DMPs from a wide array of fields; obtained from a number of different sources outlined below. Data included/extracted from the examples include the discipline and field of study, author, institutional affiliation and funding information, location, date created, title, research and data-type, description of project, link to the DMP, and where possible external links to related publications or grant pages. This CSV document serves as the content for a McMaster Data Management Plan (DMP) Database as part of the Research Data Management (RDM) Services website, located at https://u.mcmaster.ca/dmps. Other universities and organizations are encouraged to link to the DMP Database or use this dataset as the content for their own DMP Database. This dataset will be updated regularly to include new additions and will be versioned as such. We are gathering submissions at https://u.mcmaster.ca/submit-a-dmp to continue to expand the collection.
Facebook
Twitter
According to our latest research, the global insurance third-party data enrichment market size reached USD 2.56 billion in 2024, demonstrating the sector’s robust expansion fueled by the increasing demand for advanced analytics in the insurance industry. With a compelling compound annual growth rate (CAGR) of 13.4% projected for the forecast period, the market is expected to achieve a value of USD 7.87 billion by 2033. The primary growth factor driving this market is the insurance sector’s accelerating shift towards data-driven decision-making, leveraging third-party data to enhance risk assessment, streamline claims management, and personalize customer experiences.
The surge in digital transformation initiatives across the insurance industry is a pivotal growth catalyst for the insurance third-party data enrichment market. Insurers are increasingly seeking ways to differentiate their offerings and improve operational efficiencies in a highly competitive landscape. By integrating external data sources—such as demographic, behavioral, and technographic data—insurers gain deeper insights into customer needs, risk profiles, and emerging market trends. This enables more accurate underwriting, proactive fraud detection, and tailored product recommendations, which collectively boost customer satisfaction and retention rates. Furthermore, the proliferation of connected devices, IoT, and big data analytics platforms is expanding the pool of actionable data, empowering insurers to make more informed decisions across the value chain.
Another significant growth factor is the rising incidence of insurance fraud and the corresponding need for robust fraud detection mechanisms. Third-party data enrichment solutions empower insurers to cross-verify applicant information, identify anomalies, and flag suspicious activities in real-time. Advanced machine learning algorithms and AI-powered analytics are increasingly being integrated into these solutions, enhancing their ability to detect complex fraud patterns that traditional methods may overlook. As regulatory scrutiny intensifies and insurers face mounting pressure to minimize losses, investment in sophisticated data enrichment tools is becoming indispensable for maintaining profitability and compliance.
The evolving regulatory landscape is also shaping market growth, as insurers must navigate a complex web of data privacy laws and compliance requirements. The adoption of third-party data enrichment solutions facilitates adherence to these regulations by ensuring data accuracy, enhancing transparency, and supporting robust audit trails. In addition, partnerships between insurers and data providers are fostering the development of innovative enrichment solutions tailored to specific insurance segments such as life, health, and property & casualty insurance. These collaborations are accelerating the adoption of enriched data across diverse applications, further propelling market expansion.
From a regional perspective, North America continues to dominate the insurance third-party data enrichment market, accounting for the largest revenue share in 2024, driven by the presence of leading insurance providers, advanced data infrastructure, and a strong regulatory framework. However, Asia Pacific is emerging as the fastest-growing region, fueled by rapid digitalization, increasing insurance penetration, and a burgeoning middle class. Meanwhile, Europe is witnessing steady growth, supported by stringent regulatory mandates and a mature insurance ecosystem. Latin America and the Middle East & Africa are also experiencing gradual adoption, with insurers in these regions increasingly recognizing the value of third-party data enrichment to enhance competitiveness and operational efficiency.
The insurance third-party data enrichment market is segmented by component into solutions and services, each playing a c
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Business roles at AgroStar require a baseline of analytical skills, and it is also critical that we are able to explain complex concepts in a simple way to a variety of audiences. This test is structured so that someone with the baseline skills needed to succeed in the role should be able to complete this in under 4 hours without assistance.
Use the data in the included sheet to address the following scenario...
Since its inception, AgroStar has been leveraging an assisted marketplace model. Given that the market potential is huge and that the target customer appreciates a physical store nearby, we have taken a call to explore the offline retail model to drive growth. The primary objective is to get a larger wallet share for AgroStar among existing customers.
Assume you are back in time, in August 2018 and you have been asked to determine the location (taluka) of the first AgroStar offline retail store. 1. What are the key factors you would use to determine the location? Why? 2. What taluka (across three states) would you look open in? Why?
-- (1) Please mention any assumptions you have made and the underlying thought process
-- (2) Please treat the assignment as standalone (it should be self-explanatory to someone who reads it), but we will have a follow-up discussion with you in which we will walk through your approach to this assignment.
-- (3) Mention any data that may be missing that would make this study more meaningful
-- (4) Kindly conduct your analysis within the spreadsheet, we would like to see the working sheet. If you face any issues due to the file size, kindly download this file and share an excel sheet with us
-- (5) If you would like to append a word document/presentation to summarize, please go ahead.
-- (6) In case you use any external data source/article, kindly share the source.
The file CDNOW_master.txt contains the entire purchase history up to the end of June 1998 of the cohort of 23,570 individuals who made their first-ever purchase at CDNOW in the first quarter of 1997. This CDNOW dataset was first used by Fader and Hardie (2001).
Each record in this file, 69,659 in total, comprises four fields: the customer's ID, the date of the transaction, the number of CDs purchased, and the dollar value of the transaction.
CustID = CDNOW_master(:,1); % customer id Date = CDNOW_master(:,2); % transaction date Quant = CDNOW_master(:,3); % number of CDs purchased Spend = CDNOW_master(:,4); % dollar value (excl. S&H)
See "Notes on the CDNOW Master Data Set" (http://brucehardie.com/notes/026/) for details of how the 1/10th systematic sample (http://brucehardie.com/datasets/CDNOW_sample.zip) used in many papers was created.
Reference:
Fader, Peter S. and Bruce G.,S. Hardie, (2001), "Forecasting Repeat Sales at CDNOW: A Case Study," Interfaces, 31 (May-June), Part 2 of 2, S94-S107.
I have merged all three datasets into one file and also did some feature engineering.
Available Data: You will be given anonymized user gameplay data in the form of 3 csv files.
Fields in the data are as described below:
Gameplay_Data.csv contains the following fields:
* Uid: Alphanumeric unique Id assigned to user
* Eventtime: DateTime on which user played the tournament
* Entry_Fee: Entry Fee of tournament
* Win_Loss: ‘W’ if the user won that particular tournament, ‘L’ otherwise
* Winnings: How much money the user won in the tournament (0 for ‘L’)
* Tournament_Type: Type of tournament user played (A / B / C / D)
* Num_Players: Number of players that played in this tournament
Wallet_Balance.csv contains following fields: * Uid: Alphanumeric unique Id assigned to user * Timestamp: DateTime at which user’s wallet balance is given * Wallet_Balance: User’s wallet balance at given time stamp
Demographic.csv contains following fields: * Uid: Alphanumeric unique Id assigned to user * Installed_At: Timestamp at which user installed the app * Connection_Type: User’s internet connection type (Ex: Cellular / Dial Up) * Cpu_Type: Cpu type of device that the user is playing with * Network_Type: Network type in encoded form * Device_Manufacturer: Ex: Realme * ISP: Internet Service Provider. Ex: Airtel * Country * Country_Subdivision * City * Postal_Code * Language: Language that user has selected for gameplay * Device_Name * Device_Type
Build a basic recommendation system which is able to rank/recommend relevant tournaments and entry prices to the user. The main objectives are: 1. A user should not have to scroll too much before selecting a tournament of their preference 2. We would like the user to play as high an entry fee tournament as possible
Facebook
TwitterIn 2020, according to respondents surveyed, data masters typically leverage a variety of external data sources to enhance their insights. The most popular external data sources for data masters being publicly available competitor data, open data, and proprietary datasets from data aggregators, with **, **, and ** percent, respectively.