Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contain informative data related to COVID-19 pandemic. Specially, figure out about the First Case and First Death information for every single country. The datasets mainly focus on two major fields first one is First Case which consists of information of Date of First Case(s), Number of confirm Case(s) at First Day, Age of the patient(s) of First Case, Last Visited Country and the other one First Death information consist of Date of First Death and Age of the Patient who died first for every Country mentioning corresponding Continent. The datasets also contain the Binary Matrix of spread chain among different country and region.
*This is not a country. This is a ship. The name of the Cruise Ship was not given from the government.
"N+": the age is not specified but greater than N
“No Trace”: some data was not found
“Unspecified”: not available from the authority
“N/A”: for “Last Visited Country(s) of Confirmed Case(s)” column, “N/A” indicates that the confirmed case(s) of those countries do not have any travel history in recent past; in “Age of First Death(s)” column “N/A” indicates that those countries do not have may death case till May 16, 2020.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
These files are videos generated by a stochastic simulation that was created by Nikki Steenbakkers under the supervision of Marko Boon and Bert Zwart (all affiliated with Eindhoven University of Technology) for her bachelor final project "Simulating the Spread of COVID-19 in the Netherlands". The report can be found in the TU/e repository of bachelor project reports:https://research.tue.nl/en/studentTheses/simulating-the-spread-of-covid-19-in-the-netherlandsThe report contains more information about the project and the simulation. It explicitly refers to these files.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
League of Legends is a popular global online game played by millions of players monthly. In the past few years, the League of Legends e-sports industry has shown phenomenal growth. Just recently in 2020, the World Championship finals drew 3.8 million peak viewers! While the e-sports industry still lags behind traditional sports in terms of popularity and viewership, it has shown exponential growth in certain regions with fast-growing economy, such as Vietnam and China, making it a prime target for sponsorship for foreign companies looking to spread brand awareness in these regions.
While the e-sports data industry is also showing gradual growth, there is not much available publicly in terms of published analysis of individual games. This may be due to the fact that the games are fast-changing compared to traditional sports--rules and game stats are frequently and arbitrarily changed by the developers. Nevertheless it is an interesting field for fun researches: hence the reason for many pet projects and graduate-level papers dedicated to this field.
All existing League of Legends games (minus custom games, including ones from competitions) are made available by Riot's API. However, having to request and parse the data for every single relevant game is quite annoying; this dataset intends to save that work for you. To make things (hopefully) easier, I parsed all JSON files returned by Riot API into CSV files, with each row corresponding to one game.
This dataset consists of three parts: root games, root2tail, and tail games.
I found that quite often when trying to predict the outcome of a match prior to its play, the historical matches of a player prior to that game count as an important factor (Hall, 2017). For such purpose, root games contains 1087 games from which tail games branches out.
Tail games contains historical matches of each player for every game in root games. Root2tail maps root games's each player's account ID and that player's controlled champion ID to a list of matches that can be found in tail games.
To simplify the explanation, if you want to access historical matches of a player in root games file, 1. Get player's account ID and the game ID. 2. Load root2tail file. 3. Queue for matching row on account ID and game ID. 4. The corresponding row contains a list of game IDs that can be queued on tail_games files.
Note that root2tail documents most recent 5 matches, or a list of matches played within the past 5 weeks, prior to the game creation date of the corresponding "root game". It also only documents the most recent games by the player played with the same champion he/she played in the "root game". In cases where there is an empty list, it means the player has not played a single match with the same champion within the past 5 weeks.
On 2020, December 5th, I fetched the list of current players in Challenger tier, then recursively gathered historical matches of those players to consist root games, so this is the data collection date.
Root2tail is self-explanatory. As for the other files, each row represents a single game. The columns are quite confusing, however, as it is a flattened version of a JSON file with nested lists of dictionaries.
I tried to think of the simplest way to make the columns comprehensible, but looking at the original JSON file is most likely the simplest way to understand the structure. Use tools like https://jsonformatter.curiousconcept.com/ to inspect the dummy_league_match.json file.
A very simple explanation: participant.stats._ and participant.timeline._ contains pretty much all match-related statistics of a player during the game.
Also, note that the "accountId" fields use encrypted account IDs which are specific to my API key. If you want to do additional research using player account IDs, you should fetch the match file first and get your own list of player account IDs.
The following are great resources I got a lot of help from: 1. https://riot-watcher.readthedocs.io/en/latest/ 2. https://riot-api-libraries.readthedocs.io/en/latest/
These two actually explain everything you need to get started on your own project with Riot API.
The following are links to related projects that could maybe help you get ideas!
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides up-to-date global COVID-19 statistics from the year 2025. It includes essential attributes such as country name, continent, date, population, active cases, total cases, recovered cases, and cases per 1 million population (1M_pop), etc. The data offers a comprehensive view of the current pandemic situation worldwide.
Inspiration & Use Cases: - Suitable for beginners exploring the field of data science and looking to practice real-world data analysis. - Ideal for performing Exploratory Data Analysis (EDA) to identify trends, patterns, and anomalies in the COVID-19 spread across different regions. - Can be used to build predictive models, e.g., forecasting the future number of cases based on recent growth trends.
Facebook
TwitterOn 1 April 2025 responsibility for fire and rescue transferred from the Home Office to the Ministry of Housing, Communities and Local Government.
This information covers fires, false alarms and other incidents attended by fire crews, and the statistics include the numbers of incidents, fires, fatalities and casualties as well as information on response times to fires. The Ministry of Housing, Communities and Local Government (MHCLG) also collect information on the workforce, fire prevention work, health and safety and firefighter pensions. All data tables on fire statistics are below.
MHCLG has responsibility for fire services in England. The vast majority of data tables produced by the Ministry of Housing, Communities and Local Government are for England but some (0101, 0103, 0201, 0501, 1401) tables are for Great Britain split by nation. In the past the Department for Communities and Local Government (who previously had responsibility for fire services in England) produced data tables for Great Britain and at times the UK. Similar information for devolved administrations are available at https://www.firescotland.gov.uk/about/statistics/">Scotland: Fire and Rescue Statistics, https://statswales.gov.wales/Catalogue/Community-Safety-and-Social-Inclusion/Community-Safety">Wales: Community safety and https://www.nifrs.org/home/about-us/publications/">Northern Ireland: Fire and Rescue Statistics.
If you use assistive technology (for example, a screen reader) and need a version of any of these documents in a more accessible format, please email alternativeformats@communities.gov.uk. Please tell us what format you need. It will help us if you say what assistive technology you use.
Fire statistics guidance
Fire statistics incident level datasets
https://assets.publishing.service.gov.uk/media/68f0f810e8e4040c38a3cf96/FIRE0101.xlsx">FIRE0101: Incidents attended by fire and rescue services by nation and population (MS Excel Spreadsheet, 143 KB) Previous FIRE0101 tables
https://assets.publishing.service.gov.uk/media/68f0ffd528f6872f1663ef77/FIRE0102.xlsx">FIRE0102: Incidents attended by fire and rescue services in England, by incident type and fire and rescue authority (MS Excel Spreadsheet, 2.12 MB) Previous FIRE0102 tables
https://assets.publishing.service.gov.uk/media/68f20a3e06e6515f7914c71c/FIRE0103.xlsx">FIRE0103: Fires attended by fire and rescue services by nation and population (MS Excel Spreadsheet, 197 KB) Previous FIRE0103 tables
https://assets.publishing.service.gov.uk/media/68f20a552f0fc56403a3cfef/FIRE0104.xlsx">FIRE0104: Fire false alarms by reason for false alarm, England (MS Excel Spreadsheet, 443 KB) Previous FIRE0104 tables
https://assets.publishing.service.gov.uk/media/68f100492f0fc56403a3cf94/FIRE0201.xlsx">FIRE0201: Dwelling fires attended by fire and rescue services by motive, population and nation (MS Excel Spreadsheet, 192 KB) Previous FIRE0201 tables
<span class="gem
Facebook
TwitterA statistical "almanac" for London. Data mostly comes from 3rd party sources, especially ONS.
All data has been uploaded to google docs (though spread across many spreadsheets so download-url links to listing page not the raw data).
License isn't clear (ons data probably covered by click-use but other data comes from UN etc).
Facebook
TwitterBackgroundEhrlichia canis, a rickettsial organism, is responsible for causing ehrlichiosis, a tick-borne disease affecting dogs.ObjectivesThis study aimed to estimate ehrlichiosis prevalence and identify associated risk factors in pet dogs.MethodsA total of 246 peripheral blood samples were purposively collected from pet dogs in Dhaka, Mymensingh, and Rajshahi districts between December 2018 and December 2020. Risk factor data were obtained through face-to-face interviews with dog owners using a pre-structured questionnaire. Multivariable logistic regression analysis identified risk factors. Polymerase chain reaction targeting the 16S rRNA gene confirmed Ehrlichia spp. PCR results were further validated by sequencing.ResultsThe prevalence and case fatality of ehrlichiosis were 6.9% and 47.1%, respectively. Dogs in rural areas had 5.8 times higher odds of ehrlichiosis (odd ratio, OR: 5.84; 95% CI: 1.72–19.89) compared to urban areas. Dogs with access to other dogs had 5.14 times higher odds of ehrlichiosis (OR: 5.14; 95% CI: 1.63–16.27) than those without such access. Similarly, irregularly treated dogs with ectoparasitic drugs had 4.01 times higher odds of ehrlichiosis (OR: 4.01; 95% CI: 1.17–14.14) compared to regularly treated dogs. The presence of ticks on dogs increased ehrlichiosis odds nearly by 3 times (OR: 3.02; 95% CI: 1.02–8.97). Phylogenetic analysis, based on 17 commercially sequenced isolates, showed different clusters of aggregation, however, BAUMAH-13 (PP321265) perfectly settled with a China isolate (OK667945), similarly, BAUMAH-05 (PP321257) with Greece isolate (MN922610), BAUMAH-16 (PP321268) with Italian isolate (KX180945), and BAUMAH-07 (PP321259) with Thailand isolate (OP164610).ConclusionsPet owners and veterinarians in rural areas should be vigilant in monitoring dogs for ticks and ensuring proper preventive care. Limiting access to other dogs in high-risk areas can help mitigate disease spread. Tick prevention measures and regular treatment with ectoparasitic drugs will reduce the risk of ehrlichiosis in dogs. The observed genetic similarity of the Bangladeshi Ehrlichia canis strain highlights the need for ongoing surveillance and research to develop effective control and prevention strategies, both within Bangladesh and globally.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.
However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.
2 Data-set Introduction
2.1 Data Collection
We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:
The headline must have one or more words directly or indirectly related to COVID-19.
The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.
The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.
Avoid taking duplicate reports.
Maintain a time frame for the above mentioned newspapers.
To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.
2.2 Data Pre-processing and Statistics
Some pre-processing steps performed on the newspaper report dataset are as follows:
Remove hyperlinks.
Remove non-English alphanumeric characters.
Remove stop words.
Lemmatize text.
While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.
The primary data statistics of the two dataset are shown in Table 1 and 2.
Table 1: Covid-News-USA-NNK data statistics
No of words per headline
7 to 20
No of words per body content
150 to 2100
Table 2: Covid-News-BD-NNK data statistics No of words per headline
10 to 20
No of words per body content
100 to 1500
2.3 Dataset Repository
We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.
3 Literature Review
Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.
Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].
Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.
Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.
4 Our experiments and Result analysis
We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:
In February, both the news paper have talked about China and source of the outbreak.
StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.
Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.
Washington Post discussed global issues more than StarTribune.
StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.
While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.
We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases
where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract:
Analyzing the spread of information related to a specific event in the news has many potential applications. Consequently, various systems have been developed to facilitate the analysis of information spreadings such as detection of disease propagation and identification of the spreading of fake news through social media. There are several open challenges in the process of discerning information propagation, among them the lack of resources for training and evaluation. This paper describes the process of compiling a corpus from the EventRegistry global media monitoring system. We focus on information spreading in three domains: sports (i.e. the FIFA WorldCup), natural disasters (i.e. earthquakes), and climate change (i.e.global warming). This corpus is a valuable addition to the currently available datasets to examine the spreading of information about various kinds of events.Introduction:Domain-specific gaps in information spreading are ubiquitous and may exist due to economic conditions, political factors, or linguistic, geographical, time-zone, cultural, and other barriers. These factors potentially contribute to obstructing the flow of local as well as international news. We believe that there is a lack of research studies that examine, identify, and uncover the reasons for barriers in information spreading. Additionally, there is limited availability of datasets containing news text and metadata including time, place, source, and other relevant information. When a piece of information starts spreading, it implicitly raises questions such as asHow far does the information in the form of news reach out to the public?Does the content of news remain the same or changes to a certain extent?Do the cultural values impact the information especially when the same news will get translated in other languages?Statistics about datasets:
Statistics about datasets:
--------------------------------------------------------------------------------------------------------------------------------------
# Domain Event Type Articles Per Language Total Articles
1 Sports FIFA World Cup 983-en, 762-sp, 711-de, 10-sl, 216-pt 2679
2 Natural Disaster Earthquake 941-en, 999-sp, 937-de, 19-sl, 251-pt 3194
3 Climate Changes Global Warming 996-en, 298-sp, 545-de, 8-sl, 97-pt 1945
--------------------------------------------------------------------------------------------------------------------------------------
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
EN: The dataset is based on tables with detailed data for municipalities and boroughs of the population census and the occupational census of the Netherlands 1947. These detailed tables from the archive of Statistics Netherlands never have been published. They are written on so-called ‘transparanten’, sheets in A4-format. The set contains more than 35 table-types, some of which spread over two or more sheets, some combined on one sheet.Image scans of the detailed tables have been made in February 2005. Those scans, 29489 in total were published on www.volkstellingen.nl, ordered by province and municipality. In a later stage the scans have been converted by data-entry to Excel worksheets. In most cases one scan has been converted to one Excel file. However, if a scan contains two or more tables, a separate Excel file is made for each table. The Excel files also have been converted to CSV-text files.The thematic collection: 12th Population Census 31 May 1947 contains 11 datasets for the provinces plus one dataset for the Netherlands as a whole. The documentation for any dataset in the collection contains a description of the contents of all table-types and the instruction given for data-entry.This dataset regards the files of the province Groningen. The files are grouped by municipality. General files for the province Groningen are contained in the dataset for the Netherlands as a whole.The metadata per file (details) contains the table number. An overview of table numbers by file is contained in ‘Table number per scan_Groningen.csv’. This applies for the scans as well as the Excel files and the CSV-text files. The file 'Titles of Tables' shows the table numbers with the corresponding titles of the tables.NL: De dataset is gebaseerd op gedetailleerde tabellen op plaatselijk en wijkniveau van de Volks- en Beroepstellingen 1947. Deze gedetailleerde tabellen uit het CBS-archief zijn nooit gepubliceerd. Zij stonden op perkamentachtig papier (‘transparanten’) in A4-formaat. Het betreft meer dan 35 tabeltypen, waarvan sommige per tabel op één transparant, sommige per tabel gespreid over twee of meer transparanten (afhankelijk van de grootte van de gemeente) en enkele met twee of drie tabellen op één transparant.Van deze gedetailleerde tabellen zijn in februari 2005 tijdens de Landelijke Contactdag Document Management image scans gemaakt in JPEG-formaat. De in totaal 29489 scans zijn in eerste instantie opgenomen op de website www.volkstellingen.nl, geordend per provincie en gemeente. Later zijn de scans met data-entry overgenomen in Excelbestanden. In principe is van elke scan één Excelbestand gemaakt. Alleen als een scan twee of meer tabellen bevat, is van elke tabel een afzonderlijk Excelbestand gemaakt. De Excelbestanden zijn ook geconverteerd naar CSV-tekstbestanden.De collectie datasets ‘Volks- en Beroepentellingen 1947’ bestaat uit 11 datasets voor de provincies plus een dataset voor Nederland als geheel. De documentatie voor alle datasets in deze collectie omvat onder meer een beschrijving van de inhoud van elk tabeltype en de instructies die zijn gegeven voor de data-entry.Deze dataset betreft de bestanden van de provincie Groningen. De bestanden zijn ingedeeld per gemeente. Algemene bestanden over de provincie Groningen bevinden zich in de dataset voor Nederland als geheel.De metadata per bestand (details) bevat het tabelnummer. Een overzicht met het tabelnummer per bestand staat in ‘Table number per scan_Groningen.csv’. Dat is ook van toepassing op de bijbehorende Excelbestanden en CSV-tekstbestanden. Het bestand 'Titles of Tables' geeft een overzicht van de tabelnummers met de bijbehorende tabelnamen. Dit bestand is beschikbaar gesteld als pdf-document en als CSV-tekstbestand. 12de volkstelling 31 mei 1947 - Groningen
Facebook
TwitterDue to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. The first 9 weeks of data (from January 1st, 2020 to March 11th, 2020) contain very low tweet counts as we filtered other data we were collecting for other research purposes, however, one can see the dramatic increase as the awareness for the virus spread. Dedicated data gathering started from March 11th to March 22nd which yielded over 4 million tweets a day.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (40,823,816 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (7,479,940 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter)
As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. The need to be hydrated to be used.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides economic indicators used to monitor Iowa's economy and forecast future direction of economic activity in Iowa.
Facebook
TwitterList of the data tables as part of the Immigration system statistics Home Office release. Summary and detailed data tables covering the immigration system, including out-of-country and in-country visas, asylum, detention, and returns.
If you have any feedback, please email MigrationStatsEnquiries@homeoffice.gov.uk.
The Microsoft Excel .xlsx files may not be suitable for users of assistive technology.
If you use assistive technology (such as a screen reader) and need a version of these documents in a more accessible format, please email MigrationStatsEnquiries@homeoffice.gov.uk
Please tell us what format you need. It will help us if you say what assistive technology you use.
Immigration system statistics, year ending September 2025
Immigration system statistics quarterly release
Immigration system statistics user guide
Publishing detailed data tables in migration statistics
Policy and legislative changes affecting migration to the UK: timeline
Immigration statistics data archives
https://assets.publishing.service.gov.uk/media/691afc82e39a085bda43edd8/passenger-arrivals-summary-sep-2025-tables.ods">Passenger arrivals summary tables, year ending September 2025 (ODS, 31.5 KB)
‘Passengers refused entry at the border summary tables’ and ‘Passengers refused entry at the border detailed datasets’ have been discontinued. The latest published versions of these tables are from February 2025 and are available in the ‘Passenger refusals – release discontinued’ section. A similar data series, ‘Refused entry at port and subsequently departed’, is available within the Returns detailed and summary tables.
https://assets.publishing.service.gov.uk/media/691b03595a253e2c40d705b9/electronic-travel-authorisation-datasets-sep-2025.xlsx">Electronic travel authorisation detailed datasets, year ending September 2025 (MS Excel Spreadsheet, 58.6 KB)
ETA_D01: Applications for electronic travel authorisations, by nationality
ETA_D02: Outcomes of applications for electronic travel authorisations, by nationality
https://assets.publishing.service.gov.uk/media/6924812a367485ea116a56bd/visas-summary-sep-2025-tables.ods">Entry clearance visas summary tables, year ending September 2025 (ODS, 53.3 KB)
https://assets.publishing.service.gov.uk/media/691aebbf5a253e2c40d70598/entry-clearance-visa-outcomes-datasets-sep-2025.xlsx">Entry clearance visa applications and outcomes detailed datasets, year ending September 2025 (MS Excel Spreadsheet, 30.2 MB)
Vis_D01: Entry clearance visa applications, by nationality and visa type
Vis_D02: Outcomes of entry clearance visa applications, by nationality, visa type, and outcome
Additional data relating to in country and overse
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Zoonotic Disease DatasetZoonotic diseases are infections that spread between people and animals. This dataset contains information to investigate the correlation between climate variables (temperature, precipitation) and zoonotic disease in different countries across different years. The data is clean and does not have missing values.Dataset Variables:Country: region from where data was collectedYear: year when data was collectedTemperature: collected in degrees Celsius Precipitation: collected in millimeters (mm)Zoonotic Cases: number of zoonotic infections Population Density: number of people per kilometer square of countryUrbanization Rate: percentage of country's population living in urban areas
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains statistical data on the growth and spread of internet rate in the Kingdom of Saudi Arabia.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
By US Open Data Portal, data.gov [source]
This Electronic Health Information Legal Epidemiology dataset offers an extensive collection of legal and epidemiological data that can be used to understand the complexities of electronic health information. It contains a detailed balance of variables, including legal requirements, enforcement mechanisms, proprietary tools, access restrictions, privacy and security implications, data rights and responsibilities, user accounts and authentication systems. This powerful set provides researchers with real-world insights into the functioning of EHI law in order to assess its impact on patient safety and public health outcomes. With such data it is possible to gain a better understanding of current policies regarding the regulation of electronic health information as well as their potential for improvement in safeguarding patient confidentiality. Use this dataset to explore how these laws impact our healthcare system by exploring patterns across different groups over time or analyze changes leading up to new versions or updates. Make exciting discoveries with this comprehensive dataset!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Start by familiarizing yourself with the different columns of the dataset. Examine each column closely and look up any unfamiliar terminology to get a better understanding of what the columns are referencing.
Once you understand the data and what it is intended to represent, think about how you might want to use it in your analysis. You may want to create a research question, or narrower focus for your project surrounding legal epidemiology of electronic health information that can be answered with this data set.
After creating your research plan, begin manipulating and cleaning up the data as needed in order to prepare it for analysis or visualization as specified in your project plan or research question/model design steps you have outlined .
4 .Next, perform exploratory data analysis (EDA) on relevant subsets of data from specific countries if needed on specific subsets based on targets of interests (e.g gender). Filter out irrelevant information necessary for drawing meaningful insights; analyze patterns and trends observed in your filtered datasets ; compare areas which have differing rates e-health related rules and regulations tying decisions made by elected officials strongly driven by demographics , socioeconomics factors ,ideology etc.. . Look out for correlations using statistical information as needed throughout all stages in process from filtering out dis-informative subgroups from full population set til generating visualizations(graphs/ diagrams) depicting valid insight leveraging descriptive / predictive models properly validate against reference datasets when available always keep openness principal during gathering info especially when needs requires contact external sources such validating multiple sources work best provide strong seals establishing validity accuracy facts statement representing humans case scenarios digital support suitably localized supporting local languages culture respectively while keeping secure datasets private visible limited particular users duly authorized access 5 Finally create concrete summaries reporting discoveries create share findings preferably infographics showcasing evidence observances providing overall assessment main conclusions protocols developed so far broader community indirectly related interested professionals able benefit those results ideas complete transparently freely adapted locally ported increase overall global society level enhancing potentiality range impact derive conditions allowing wider adoption increased usage diffusion capture wide spread change movement affect global e-health legal domain clear manner
- Studying how technology affects public health policies and practice - Using the data, researchers can look at the various types of legal regulations related to electronic health information to examine any relations between technology and public health decisions in certain areas or regions.
- Evaluating trends in legal epidemiology – With this data, policymakers can identify patterns that help measure the evolution of electronic health information regulations over time and investigate why such rules are changing within different states or countries.
- Analysing possible impacts on healthcare costs – Looking at changes in laws, regulations, and standards relate...
Facebook
TwitterThis dataset allows for a comparison between the offline and online spread of COVID-19.
The dataset was obtained using the following APIs: https://github.com/pomber/covid19 https://github.com/GeneralMills/pytrends https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews
Can internet traffic data help to understand the spread of the virus?
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset and this description is made available on http://www-stat.stanford.edu/~tibs/ElemStatLearn/data.html.
Normalized handwritten digits, automatically scanned from envelopes by the U.S. Postal Service. The original scanned digits are binary and of different sizes and orientations; the images here have been deslanted and size normalized, resulting in 16 x 16 grayscale images (Le Cun et al., 1990).
The data are in two gzipped files, and each line consists of the digit id (0-9) followed by the 256 grayscale values.
There are 7291 training observations and 2007 test observations, distributed as follows: 0 1 2 3 4 5 6 7 8 9 Total Train 1194 1005 731 658 652 556 664 645 542 644 7291 Test 359 264 198 166 200 160 170 147 166 177 2007
or as proportions: 0 1 2 3 4 5 6 7 8 9 Train 0.16 0.14 0.1 0.09 0.09 0.08 0.09 0.09 0.07 0.09 Test 0.18 0.13 0.1 0.08 0.10 0.08 0.08 0.07 0.08 0.09
Alternatively, the training data are available as separate files per digit (and hence without the digit identifier in each row)
The test set is notoriously "difficult", and a 2.5% error rate is excellent. These data were kindly made available by the neural network group at AT&T research labs (thanks to Yann Le Cunn).
PS: In this dataset, the class is represented as 1-10
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Statistical measures extracted from the simulation results.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. This catalogue entry provides post-processed ERA5 hourly single-level data aggregated to daily time steps. In addition to the data selection options found on the hourly page, the following options can be selected for the daily statistic calculation:
The daily aggregation statistic (daily mean, daily max, daily min, daily sum*) The sub-daily frequency sampling of the original data (1 hour, 3 hours, 6 hours) The option to shift to any local time zone in UTC (no shift means the statistic is computed from UTC+00:00)
*The daily sum is only available for the accumulated variables (see ERA5 documentation for more details). Users should be aware that the daily aggregation is calculated during the retrieval process and is not part of a permanently archived dataset. For more details on how the daily statistics are calculated, including demonstrative code, please see the documentation. For more details on the hourly data used to calculate the daily statistics, please refer to the ERA5 hourly single-level data catalogue entry and the documentation found therein.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contain informative data related to COVID-19 pandemic. Specially, figure out about the First Case and First Death information for every single country. The datasets mainly focus on two major fields first one is First Case which consists of information of Date of First Case(s), Number of confirm Case(s) at First Day, Age of the patient(s) of First Case, Last Visited Country and the other one First Death information consist of Date of First Death and Age of the Patient who died first for every Country mentioning corresponding Continent. The datasets also contain the Binary Matrix of spread chain among different country and region.
*This is not a country. This is a ship. The name of the Cruise Ship was not given from the government.
"N+": the age is not specified but greater than N
“No Trace”: some data was not found
“Unspecified”: not available from the authority
“N/A”: for “Last Visited Country(s) of Confirmed Case(s)” column, “N/A” indicates that the confirmed case(s) of those countries do not have any travel history in recent past; in “Age of First Death(s)” column “N/A” indicates that those countries do not have may death case till May 16, 2020.