Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
LinkedIn is the world’s preeminent social network for professionals. Members create CVs, list their current and previous job roles, skills and education. The business network is also a recruiting...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context: Professional networking is inefficient - 90% of LinkedIn connections provide minimal mutual value. This dataset enables AI models to predict networking compatibility and recommend high-value connections before they're made.
Dataset Overview: 50,000 professional profiles paired into 500,000+ compatibility-scored combinations with detailed feature breakdowns. Synthetically generated, ML-ready data for building recommendation systems.
Files & Schema: - profiles.csv (50,000 rows) - profile_id - Unique identifier - name, email, location - Demographics - current_role, current_company - Current position - industry - Industry category - years_experience - Total years of experience - seniority_level - entry/mid/senior/executive - skills - JSON array of skills - experience - JSON work history - education - JSON education history - connections - Network size - goals, needs, can_offer - Professional objectives (JSON) - compatibility_pairs.csv (500,000+ rows) - pair_id - Unique pair identifier - profile_a_id, profile_b_id - Profile IDs - compatibility_score - Overall match (0-100) - skill_match_score - Skill overlap - skill_complementarity_score - How skills complement - network_value_a_to_b - Network value A→B provides - network_value_b_to_a - Network value B→A provides - career_alignment_score - Mentorship/learning potential - experience_gap - Years experience difference - industry_match - Industry similarity - geographic_score - Location proximity - seniority_match - Seniority compatibility - mutual_benefit_explanation - Human-readable reasoning
Facebook
TwitterThe number of LinkedIn users in the United Kingdom was forecast to continuously increase between 2024 and 2028 by in total 1.5 million users (+4.51 percent). After the eighth consecutive increasing year, the LinkedIn user base is estimated to reach 34.7 million users and therefore a new peak in 2028. User figures, shown here with regards to the platform LinkedIn, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Social Media has been taking up everything on the Internet. People getting the latest news, useful resources, life partner and what not. In a world where Social media plays a big role in giving news, we must also know that news which affects our sentiments are going to get spread like a wildfire. Based on the Headline and the title, and according to the date given and the Social media platforms, you have to predict how it has affected the human sentiment scores. You have to predict the column “SentimentTitle” and “SentimentHeadline”.
This is a subset of the dataset of the same name available in the UCI Machine Learning Repository The collected data relates to a period of 8 months, between November 2015 and July 2016, accounting for about 100,000 news items on four different topics: economy, microsoft, obama and palestine.
The attributes for each of the dataset are : - IDLink (numeric): Unique identifier of news items - Title (string): Title of the news item according to the official media sources - Headline (string): Headline of the news item according to the official media sources - Source (string): Original news outlet that published the news item - Topic (string): Query topic used to obtain the items in the official media sources - Publish-Date (timestamp): Date and time of the news items' publication - Facebook (numeric): Final value of the news items' popularity according to the social media source Facebook - Google-Plus (numeric): Final value of the news items' popularity according to the social media source Google+ - LinkedIn (numeric): Final value of the news items' popularity according to the social media source LinkedIn - SentimentTitle: Sentiment score of the title, Higher the score, better is the impact or +ve sentiment and vice-versa. (Target Variable 1) - SentimentHeadline: Sentiment score of the text in the news items' headline. Higher the score, better is the impact or +ve sentiment. (Target Variable 2)
Facebook
TwitterTo promote progress in catalysis-related research, sharing FAIR (findable, accessible, interoperable, reusable) data is essential. Shared data can inspire new understanding, prevent duplication of work and even allows for new insights through artificial intelligence-based approaches. While not all data in catalysis-related research can be shared openly, it is important to make sure that the data can be understood independent from the individual research to retain its longterm value. This requires a comprehensive description with metadata, but also technical aspects such as data formats need to be kept in mind. Join us on an excursion into making data FAIR and discover the first steps to ensure that your research data will retain its value. Stay tuned for more exciting content, and thank you for being a part of our growing community!
Check out our website: https://nfdi4cat.org/
Follow us: https://in.linkedin.com/company/nfdi4cat https://twitter.com/NFDI4Cat
Facebook
TwitterThis dataset contains statistics on the world's billionaires, including information about their businesses, industries, and personal details. It provides insights into the wealth distribution, business sectors, and demographics of billionaires worldwide.
- rank: The ranking of the billionaire in terms of wealth.
- finalWorth: The final net worth of the billionaire in U.S. dollars.
- category: The category or industry in which the billionaire's business operates.
- personName: The full name of the billionaire.
- age: The age of the billionaire.
- country: The country in which the billionaire resides.
- city: The city in which the billionaire resides.
- source: The source of the billionaire's wealth.
- industries: The industries associated with the billionaire's business interests.
- countryOfCitizenship: The country of citizenship of the billionaire.
- organization: The name of the organization or company associated with the billionaire.
- selfMade: Indicates whether the billionaire is self-made (True/False).
- status: "D" represents self-made billionaires (Founders/Entrepreneurs) and "U" indicates inherited or unearned wealth.
- gender: The gender of the billionaire.
- birthDate: The birthdate of the billionaire.
- lastName: The last name of the billionaire.
- firstName: The first name of the billionaire.
- title: The title or honorific of the billionaire.
- date: The date of data collection.
- state: The state in which the billionaire resides.
- residenceStateRegion: The region or state of residence of the billionaire.
- birthYear: The birth year of the billionaire.
- birthMonth: The birth month of the billionaire.
- birthDay: The birth day of the billionaire.
- cpi_country: Consumer Price Index (CPI) for the billionaire's country.
- cpi_change_country: CPI change for the billionaire's country.
- gdp_country: Gross Domestic Product (GDP) for the billionaire's country.
- gross_tertiary_education_enrollment: Enrollment in tertiary education in the billionaire's country.
- gross_primary_education_enrollment_country: Enrollment in primary education in the billionaire's country.
- life_expectancy_country: Life expectancy in the billionaire's country.
- tax_revenue_country_country: Tax revenue in the billionaire's country.
- total_tax_rate_country: Total tax rate in the billionaire's country.
- population_country: Population of the billionaire's country.
- latitude_country: Latitude coordinate of the billionaire's country.
- longitude_country: Longitude coordinate of the billionaire's country.
- Wealth distribution analysis: Explore the distribution of billionaires' wealth across different industries, countries, and regions.
- Demographic analysis: Investigate the age, gender, and birthplace demographics of billionaires.
- Self-made vs. inherited wealth: Analyze the proportion of self-made billionaires and those who inherited their wealth.
- Economic indicators: Study correlations between billionaire wealth and economic indicators such as GDP, CPI, and tax rates.
- Geospatial analysis: Visualize the geographical distribution of billionaires and their wealth on a map.
- Trends over time: Track changes in billionaire demographics and wealth over the years.
If this was helpful, a vote is appreciated 🙂❤️
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains information in preparation of forthcoming publication: - extracted from public open data accessible via web (see "Production Notes" section at the end for details) - overall aim: comparing company data pre- and post-COVID, i.e. evolution from 2019 to 2022 (balance sheet due July 2023)
As the project progresses, more material will both added to this dataset, and within the dedicated GitHub repository
On 2025-06-02, as part of a side-project related to the same data source, derived from the scripts created previously to retrieve YahooFinance data, a new script and associated list focused on the companies within the MIB40 index.
Please refer to Github for more information, and to access the CSV and associated Jupyter Notebook.
General description: see linkedin post
Rationale of dataset and the associated project: Reading pre- and post-COVID corporate narratives, the Italian case: a dataset in fieri
See associated notebook (more charts will be added as further information willl be integrated)
The first file contained in this dataset is the list of stocks and warrants presented on the website of Borsa Italian as of 2023-07-11, specifically the following structure (structure latest updated on 2025-11-26, see notes below):
| column | name | datatype | description |
|---|---|---|---|
| 1 | # | numeric | position index |
| 2 | stock | text | name of the company, as per Borsa Italiana website |
| 3 | link | URL | URL link to the page |
| 4 | market | text | subsection of the "listino", as per Borsa Italiana website |
| 5 | ISIN | text | stock identification code, starting with a 2-char country code, followed by 10 digits |
| 6 | profile | URL | URL link to the profile page for the stock (if filled by the company) |
| 7 | detailspresent | char | Y=if a page with details was linked, N=details page not present |
| 8 | withinstudy | string | only for ISINs starting with IT where there was a value within the profile URL: blank if retained within the study, "MissingReports" if financial reports are partial or not available, "NotCoveringPeriod" if some financial reports 2019-2021 are missing |
| 9 | covidstudy | string | within those selected in column 8, further restricted, based on data available, companies for a study comparing pre- and post-Covid financial and operational information; values: Y = within the study / N = excluded due to data / outofscope = not within the scope |
| 10 | industry | string | na = not available: if a value is present = as listed by industry on BorsaItaliana.it |
| 11 | subindustry | string | na = not available: if a value is present = as listed by subindustry within the industry on BorsaItaliana.it |
| 12 | 2019accounts | string | languages of the 2019 accounts for companies whose "covidstudy" (column 9) is "Y"; if both English and Italian are available, EN is listed |
| 13 | 2021accounts | string | languages of the 2021 accounts for companies whose "covidstudy" (column 9) is "Y"; if both English and Italian are available, EN is listed |
| 14 | UsedforENG | string | string: Y if used for the text-based part of the study, i.e. those that have EN in both "2019accounts" and "2021accounts" |
| 15 | YahooFinanceURL | URL | using the ISIN as main point of reference, the link to YahooFinance page presenting financials; where non was available, "na" |
| 16 | checkvs2021yahoo | string | included=data reconciliation successful and company included in sample; bankassfin=company excluded but included in future study on bank/assurance/finance; excluded=company excluded for other reasons |
| 17 | MIB40 | string | string: Y if within the MIB40 Index; otherwise null |
Note: * this table is kept as a CSV source, which was build on 2023-07-12 using the information extracted on 2023-07-11 from the Borsa Italiana website (specifically, the "listino A-Z" 30 pages available) * only the latest version of this dataset is always visible * it has been updated on 2023-08-04, adding column 8 ("withinstudy") after retrieving the financial reports for all the companies on Borsa Milano that fulfill the condition described in the table able for column 8 * it has been updated on 2023-09-03, adding column 9 ("covidstudy") after identifying which companies are part of the study (i.e. beside the other conditions, annual reports for 2019 and 2021 are available) * it has been updated on 2023-11-02, adding column 10 ("industry") and 11 ("subi...
Facebook
TwitterThe number of Pinterest users in the United Kingdom was forecast to continuously increase between 2024 and 2028 by in total 0.3 million users (+3.14 percent). After the ninth consecutive increasing year, the Pinterest user base is estimated to reach 9.88 million users and therefore a new peak in 2028. Notably, the number of Pinterest users of was continuously increasing over the past years.User figures, shown here regarding the platform pinterest, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
Facebook
TwitterThe number of Instagram users in the United Kingdom was forecast to continuously increase between 2024 and 2028 by in total 2.1 million users (+7.02 percent). After the ninth consecutive increasing year, the Instagram user base is estimated to reach 32 million users and therefore a new peak in 2028. Notably, the number of Instagram users of was continuously increasing over the past years.User figures, shown here with regards to the platform instagram, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
Facebook
TwitterProject Status: Proof-of-Concept (POC) - Capstone Project
This project demonstrates a proof-of-concept system for detecting financial document anomalies within core SAP FI/CO data, specifically leveraging the New General Ledger table (FAGLFLEXA) and document headers (BKPF). It addresses the challenge that standard SAP reporting and rule-based checks often struggle to identify subtle, complex, or novel irregularities in high-volume financial postings.
The solution employs a Hybrid Anomaly Detection strategy, combining unsupervised Machine Learning models with expert-defined SAP business rules. Findings are prioritized using a multi-faceted scoring system and presented via an interactive dashboard built with Streamlit for efficient investigation.
This project was developed as a capstone, showcasing the application of AI/ML techniques to enhance financial controls within an SAP context, bridging deep SAP domain knowledge with modern data science practices.
Author: Anitha R (https://www.linkedin.com/in/anithaswamy)
Dataset Origin: Kaggle SAP Dataset by Sunitha Siva License:Other (specified in description)-No description available.
Financial integrity is critical. Undetected anomalies in SAP FI/CO postings can lead to: * Inaccurate financial reporting * Significant reconciliation efforts * Potential audit failures or compliance issues * Masking of operational errors or fraud
Standard SAP tools may not catch all types of anomalies, especially complex or novel patterns. This project explores how AI/ML can augment traditional methods to provide more robust and efficient financial monitoring.
FAGLFLEXA for reliability.FE_...) to quantify potential deviations from normalcy based on EDA and SAP knowledge.Model_Anomaly_Count) and HRF counts (HRF_Count) into a Priority_Tier for focusing investigation efforts.Review_Focus text description summarizing why an item was flagged.The project followed a structured approach:
BKPF and FAGLFLEXA data extracts. Discarded BSEG due to imbalances. Removed duplicates.sap_engineered_features.csv.(For detailed methodology, please refer to the Comprehensive_Project_Report.pdf in the /docs folder - if you include it).
Libraries:
joblib==1.4.2
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Batting_PerMatchData_T20*.csv - The dataset contains information about the performance of each batsman in each inning of the match.
Bowling_PerMatchData_T20*.csv - The dataset contains information about the performance of each bowler in each inning of the match.
Summary_T20*.csv - The dataset contains information every T20 cricket match played.
match : This column represents name of the teams playing the match.
teamInnings: This column represents the team batting in the match.
battingPos: This column represents the batting position of the batsman in the innings. The batsman at the top of the order usually has a lower batting position, and the batsman at the bottom of the order has a higher batting position.
batsmanName: This column represents the name of the batsman who is currently batting in the match.
runs: This column represents the number of runs scored by the batsman in the current innings.
balls: This column represents the number of balls faced by the batsman in the current innings.
4s: This column represents the number of boundaries hit by the batsman that have crossed the boundary rope and scored four runs.
6s: This column represents the number of sixes hit by the batsman.
SR: This column represents the batting strike rate of the batsman in the current innings. It is calculated as the number of runs scored by the batsman per 100 balls faced.
out/not_out: This column represents whether the batsman is out or not out. If the batsman is not out at the end of the innings, the value in this column would be "not out" else "out".
match_id: This column represents the unique identifier of the cricket match being played, which may be used to join this table with other tables containing additional information about the match.
match : This column represents name of the teams playing the match.
bowlingTeam: This column represents the team that is currently bowling in the match.
bowlerName: This column represents the name of the bowler who is currently bowling in the match.
overs: This column represents the number of overs bowled by the bowler in the match. One over consists of six legal deliveries (excluding wides and no-balls).
maiden: This column represents the number of maiden overs bowled by the bowler.
runs: This column represents the total number of runs conceded by the bowler in the match.
wickets: This column represents the total number of wickets taken by the bowler in the match.
economy: This column represents the economy rate of the bowler in the match. It is calculated as the average number of runs conceded per over.
0s: This column represents the number of dot balls bowled by the bowler in the match.
4s: This column represents the number of boundaries hit by the batsman off the bowler that have crossed the boundary rope and scored four runs.
6s: This column represents the number of sixes hit by the batsman off the bowler.
wides: This column represents the number of deliveries that is bowled by the bowler outside the batsman's reach and is judged to be too wide for the batsman to play.
noBalls: This column represents the number of deliveries bowled by the bowler that is illegal for some reason, such as the bowler overstepping the crease, throwing the ball rather than bowling it, or bowling a bouncer that goes above the batsman's head.
match_id: This column represents the unique identifier of the cricket match being played.
Team 1: This column represents one of the teams playing in the cricket match.
Team 2: This column represents the other team playing in the cricket match.
Winner: This column represents the winning team of the cricket match. It could be either Team 1 or Team 2.
Margin: This column represents the margin of victory for the winning team. It could be represented in terms of runs, wickets or balls remaining ...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset represents a comprehensive collection of vehicle listings from PakWheels.com, Pakistan's largest automobile website, as of 2024. It includes detailed information about various aspects of vehicles available for sale across Pakistan, including their prices, models, mileage, engine capacity, and age. This data offers a snapshot of the current automobile market in Pakistan, providing insights into vehicle valuation trends, consumer preferences, and market dynamics.
The dataset is designed for anyone interested in the Pakistani automobile market, whether they are buyers, sellers, car enthusiasts, analysts, or researchers. It provides a foundational dataset for a wide range of analytical and predictive tasks.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is designed to fine-tune the T3 AI Turkish LLM. It was created by Barathan Aslan, Ömer Faruk Çelik, and Batuhan Kalem for the T3 AI Hackathon. The dataset focuses on Turkish Agriculture.
Contributors: Barathan Aslan (https://www.linkedin.com/in/barathan-aslan-715897218/) Batuhan Kalem (https://www.linkedin.com/in/batuhankalem/) Ömer Faruk Çelik (https://www.linkedin.com/in/ömerfarukçelik/)
Question-answer pairs were generated using Gemini 1.5 Flash with multiple chains of prompts. Scoring and quality assessment were performed using Gemini 1.5 Pro.
Recommendation: For optimal fine-tuning results, we suggest excluding rows with a score value lower than 6.
Dataset provided can be used for:
Fine-tuning the T3 AI Turkish LLM. Natural language processing (NLP) tasks focused on the Turkish language. The datasets are scored based on the quality and relevance of the content, with higher scores indicating better quality.
Additionally it should be noted that:
-1 represents the "Safety" category. -2 indicates rows that were "Not Scored."
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is designed to fine-tune the T3 AI Turkish LLM. It was created by Barathan Aslan, Ömer Faruk Çelik, and Batuhan Kalem for the T3 AI Hackathon. The dataset focuses on Turkish Education Sytem.
Contributors: Barathan Aslan (https://www.linkedin.com/in/barathan-aslan-715897218/) Batuhan Kalem (https://www.linkedin.com/in/batuhankalem/) Ömer Faruk Çelik (https://www.linkedin.com/in/ömerfarukçelik/)
Question-answer pairs were generated using Gemini 1.5 Flash with multiple chains of prompts. Scoring and quality assessment were performed using Gemini 1.5 Pro.
Recommendation: For optimal fine-tuning results, we suggest excluding rows with a score value lower than 6.
Dataset provided can be used for:
Fine-tuning the T3 AI Turkish LLM. Natural language processing (NLP) tasks focused on the Turkish language. The datasets are scored based on the quality and relevance of the content, with higher scores indicating better quality.
Additionally it should be noted that:
-1 represents the "Safety" category. -2 indicates rows that were "Not Scored."
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
LinkedIn is the world’s preeminent social network for professionals. Members create CVs, list their current and previous job roles, skills and education. The business network is also a recruiting...