35 datasets found

Average daily time spent on social media worldwide 2012-2024
statista.com
ai-chatbox.pro
Updated Apr 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Average daily time spent on social media worldwide 2012-2024 [Dataset]. https://www.statista.com/statistics/433871/daily-social-media-usage-worldwide/
Explore at:
Dataset updated
Apr 10, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
How much time do people spend on social media? As of 2024, the average daily social media usage of internet users worldwide amounted to 143 minutes per day, down from 151 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of three hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in the U.S. was just two hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively. People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general. During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.
Data from: San Francisco Open Data
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DataSF (2019). San Francisco Open Data [Dataset]. https://www.kaggle.com/datasf/san-francisco
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset authored and provided by
DataSF
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
San Francisco
Description
Context

DataSF seeks to transform the way that the City of San Francisco works -- through the use of data.

https://datasf.org/about/

Content

This dataset contains the following tables: ['311_service_requests', 'bikeshare_stations', 'bikeshare_status', 'bikeshare_trips', 'film_locations', 'sffd_service_calls', 'sfpd_incidents', 'street_trees']

This data includes all San Francisco 311 service requests from July 2008 to the present, and is updated daily. 311 is a non-emergency number that provides access to non-emergency municipal services.

This data includes fire unit responses to calls from April 2000 to present and is updated daily. Data contains the call number, incident number, address, unit identifier, call type, and disposition. Relevant time intervals are also included. Because this dataset is based on responses, and most calls involved multiple fire units, there are multiple records for each call number. Addresses are associated with a block number, intersection or call box.

This data includes incidents from the San Francisco Police Department (SFPD) Crime Incident Reporting system, from January 2003 until the present (2 weeks ago from current date). The dataset is updated daily. Please note: the SFPD has implemented a new system for tracking crime. This dataset is still sourced from the old system, which is in the process of being retired (a multi-year process).

This data includes a list of San Francisco Department of Public Works maintained street trees including: planting date, species, and location. Data includes 1955 to present.

This dataset is deprecated and not being updated.

Fork this kernel to get started with this dataset.

Acknowledgements

http://datasf.org/

https://cloud.google.com/bigquery/public-data/sfo-311

https://cloud.google.com/bigquery/public-data/sffd-service-calls

https://cloud.google.com/bigquery/public-data/sfpd-reports

https://cloud.google.com/bigquery/public-data/sfo-trees

Dataset Source: SF OpenData. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://sfgov.org/ - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Banner Photo by @meric from Unplash.

Inspiration

Which neighborhoods have the highest proportion of offensive graffiti?

Which complaint is most likely to be made using Twitter and in which neighborhood?

What are the most complained about Muni stops in San Francisco?

What are the top 10 incident types that the San Francisco Fire Department responds to?

How many medical incidents and structure fires are there in each neighborhood?

What’s the average response time for each type of dispatched vehicle?

Which category of police incidents have historically been the most common in San Francisco?

What were the most common police incidents in the category of LARCENY/THEFT in 2016?

Which non-criminal incidents saw the biggest reporting change from 2015 to 2016?

What is the average tree diameter?

What is the highest number of a particular species of tree planted in a single year?

Which San Francisco locations feature the largest number of trees?
COVID19 - The New York Times
kaggle.com
zip
Updated May 18, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2020). COVID19 - The New York Times [Dataset]. https://www.kaggle.com/datasets/bigquery/covid19-nyt
Explore at:
zip(0 bytes)Available download formats
Dataset updated
May 18, 2020
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
Description
Context

This is the US Coronavirus data repository from The New York Times . This data includes COVID-19 cases and deaths reported by state and county. The New York Times compiled this data based on reports from state and local health agencies. More information on the data repository is available here . For additional reporting and data visualizations, see The New York Times’ U.S. coronavirus interactive site

Sample Queries

Query 1

Which US counties have the most confirmed cases per capita? This query determines which counties have the most cases per 100,000 residents. Note that this may differ from similar queries of other datasets because of differences in reporting lag, methodologies, or other dataset differences.

SELECT covid19.county, covid19.state_name, total_pop AS county_population, confirmed_cases, ROUND(confirmed_cases/total_pop *100000,2) AS confirmed_cases_per_100000, deaths, ROUND(deaths/total_pop *100000,2) AS deaths_per_100000 FROM bigquery-public-data.covid19_nyt.us_counties covid19 JOIN bigquery-public-data.census_bureau_acs.county_2017_5yr acs ON covid19.county_fips_code = acs.geo_id WHERE date = DATE_SUB(CURRENT_DATE(),INTERVAL 1 day) AND covid19.county_fips_code != "00000" ORDER BY confirmed_cases_per_100000 desc

Query 2

How do I calculate the number of new COVID-19 cases per day? This query determines the total number of new cases in each state for each day available in the dataset SELECT b.state_name, b.date, MAX(b.confirmed_cases - a.confirmed_cases) AS daily_confirmed_cases FROM (SELECT state_name AS state, state_fips_code , confirmed_cases, DATE_ADD(date, INTERVAL 1 day) AS date_shift FROM bigquery-public-data.covid19_nyt.us_states WHERE confirmed_cases + deaths > 0) a JOIN bigquery-public-data.covid19_nyt.us_states b ON a.state_fips_code = b.state_fips_code AND a.date_shift = b.date GROUP BY b.state_name, date ORDER BY date desc
NOAA GSOD
kaggle.com
zip
Updated Aug 30, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA (2019). NOAA GSOD [Dataset]. https://www.kaggle.com/datasets/noaa/gsod
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Aug 30, 2019
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
Authors
NOAA
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

Global Surface Summary of the Day is derived from The Integrated Surface Hourly (ISH) dataset. The ISH dataset includes global data obtained from the USAF Climatology Center, located in the Federal Climate Complex with NCDC. The latest daily summary data are normally available 1-2 days after the date-time of the observations used in the daily summaries.

Content

Over 9000 stations' data are typically available.

The daily elements included in the dataset (as available from each station) are: Mean temperature (.1 Fahrenheit) Mean dew point (.1 Fahrenheit) Mean sea level pressure (.1 mb) Mean station pressure (.1 mb) Mean visibility (.1 miles) Mean wind speed (.1 knots) Maximum sustained wind speed (.1 knots) Maximum wind gust (.1 knots) Maximum temperature (.1 Fahrenheit) Minimum temperature (.1 Fahrenheit) Precipitation amount (.01 inches) Snow depth (.1 inches)

Indicator for occurrence of: Fog, Rain or Drizzle, Snow or Ice Pellets, Hail, Thunder, Tornado/Funnel

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes global data obtained from the USAF Climatology Center. This dataset covers GSOD data between 1929 and present, collected from over 9000 stations. Dataset Source: NOAA

Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Photo by Allan Nygren on Unsplash

FiveThirtyEight Daily Show Guests Dataset

kaggle.com

zip

Updated Jan 13, 2019

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

FiveThirtyEight (2019). FiveThirtyEight Daily Show Guests Dataset [Dataset]. https://www.kaggle.com/fivethirtyeight/fivethirtyeight-daily-show-guests-dataset

Explore at:

zip(37571 bytes)Available download formats

Dataset updated

Jan 13, 2019

Dataset authored and provided by

FiveThirtyEighthttps://abcnews.go.com/538

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Content

Daily Show Guests

This folder contains data behind the story Every Guest Jon Stewart Ever Had On ‘The Daily Show’.

Header	Definition
`YEAR`	The year the episode aired
`GoogleKnowlege_Occupation`	Their occupation or office, according to Google's Knowledge Graph or, if they're not in there, how Stewart introduced them on the program.
`Show`	Air date of episode. Not unique, as some shows had more than one guest
`Group`	A larger group designation for the occupation. For instance, us senators, us presidents, and former presidents are all under "politicians"
`Raw_Guest_List`	The person or list of people who appeared on the show, according to Wikipedia. The GoogleKnowlege_Occupation only refers to one of them in a given row.

Source: Google Knowlege Graph, The Daily Show clip library, Wikipedia.

Context

This is a dataset from FiveThirtyEight hosted on their GitHub. Explore FiveThirtyEight data using Kaggle and all of the data sources available through the FiveThirtyEight organization page!

Update Frequency: This dataset is updated daily.

Acknowledgements

This dataset is maintained using GitHub's API and Kaggle's API.

This dataset is distributed under the Attribution 4.0 International (CC BY 4.0) license.

Cover photo by Oscar Nord on Unsplash
Unsplash Images are distributed under a unique Unsplash License.

Google Stock Price Data
kaggle.com
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aiden Flynn (2025). Google Stock Price Data [Dataset]. https://www.kaggle.com/datasets/flynn28/google-stock-price-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 20, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aiden Flynn
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The dataset contains data from 8/18/2004 to 2/18/2025 for GOOGL. The csv has the following header: Date,Open,High,Low,Close,Volume * Date: the date of the information * Open: price at market open * High: high price of the day * Low: lowest price of the day * Close: price at market close * Volume: volume of stock
Covid19_ChineseSocialMedia_Hotspots
kaggle.com
Updated Apr 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hirsch (2020). Covid19_ChineseSocialMedia_Hotspots [Dataset]. https://www.kaggle.com/hirschsun/covid19-chinesesocialmedia-hotspots/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 21, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hirsch
Description
Context

From the beginning of 2020 to April 8th (the day Wuhan reopened), this dataset summarizes the social media hotspots and what people focused in the mainland of China, as well as the epidemic development trend during this period. The dataset containing four .csv files covers most social media platforms in the mainland: Sina Weibo, TikTok, Toutiao and Douban.

Sina Weibo

a platform based on fostering user relationships to share, disseminate and receive information. Through either the website or the mobile app, users can upload pictures and videos publicly for instant sharing, with other users being able to comment with text, pictures and videos, or use a multimedia instant messaging service. The company initially invited a large number of celebrities to join the platform at the beginning, and has since invited many media personalities, government departments, businesses and non-governmental organizations to open accounts as well for the purpose of publishing and communicating information. To avoid the impersonation of celebrities, Sina Weibo uses verification symbols; celebrity accounts have an orange letter "V" and organizations' accounts have a blue letter "V". Sina Weibo has more than 500 million registered users;[12] out of these, 313 million are monthly active users, 85% use the Weibo mobile app, 70% are college-aged, 50.10% are male and 49.90% are female. There are over 100 million messages posted by users each day. With 90 million followers, actress Xie Na holds the record for the most followers on the platform. Despite fierce competition among Chinese social media platforms, Sina Weibo has proven to be the most popular; part of this success may be attributable to the wider use of mobile technologies in China.[https://en.wikipedia.org/wiki/Sina_Weibo]

Douyin

Douyin (English: TikTok), referred to as TikTok, is a short-video social application on mobile phones. Users can record 15-second short videos, which can easily complete mouth-to-mouth (to mouth), and built-in special effects The user can leave a message to the video. Since September 2016, Toutiao has been launched online and is positioned as a short music video community suitable for Chinese young people. The application is vertical music UGC short videos, and the number of users has grown rapidly since 2017. In June 2018, Douyin reached 500 million monthly active users worldwide and 150 million daily active users in China. [https://zh.wikipedia.org/wiki/%E6%8A%96%E9%9F%B3]

Toutiao

Toutiao or Jinri Toutiao is a Chinese news and information content platform, a core product of the Beijing-based company ByteDance. By analyzing the features of content, users and users’ interaction with content, the company's algorithm models generate a tailored feed list of content for each user. Toutiao is one of China's largest mobile platforms of content creation, aggregation and distribution underpinned by machine learning techniques, with 120 million daily active users as of September 2017. [https://en.wikipedia.org/wiki/Toutiao]

Douban

Douban.com (Chinese: 豆瓣; pinyin: Dòubàn), launched on March 6, 2005, is a Chinese social networking service website that allows registered users to record information and create content related to film, books, music, recent events, and activities in Chinese cities. It could be seen as one of the most influential web 2.0 websites in China. Douban also owns an internet radio station, which ranks No.1 in the iOS App Store in 2012. Douban was formerly open to both registered and unregistered users. For registered users, the site recommends potentially interesting books, movies, and music to them in addition to serving as a social network website such as WeChat, Weibo and record keeper; for unregistered users, the site is a place to find ratings and reviews of media. Douban has about 200 million registered users as of 2013. The site serves pan-Chinese users, and its contents are in Chinese. It covers works and media in Chinese and in foreign languages. Some Chinese authors and critics register their official personal pages on the site. [https://en.wikipedia.org/wiki/Douban]

Content

Weibo realTimeHotSearchList can be regarded as a platform for gathering celebrity gossip, social life and major news. In this document, I collect the top 50 topics of the hot search list every 12 hours during the day, so there are 100 hot topics each day. These topics are converted into English by Google translation, although the translation effect is not ideal due to sentence segmentation and language background deviation. In this document, I created a new column ['Coron-Related ( 1 yes, 0 not ) '] to mark topics related to the new crown, if relevant, it is marked as 1, if not then marked empty or 0. The google translation is extremely inaccurate (so maybe google the Chinese title to confirm is the best bet...
Z
INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET
data.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nafiz Sadman (2024). INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4047647
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Nafiz Sadman
Kishor Datta Gupta
Nishat Anjum
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Bangladesh, United States
Description
Introduction

There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.

However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.

2 Data-set Introduction

2.1 Data Collection

We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:

The headline must have one or more words directly or indirectly related to COVID-19.

The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.

The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.

Avoid taking duplicate reports.

Maintain a time frame for the above mentioned newspapers.

To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.

2.2 Data Pre-processing and Statistics

Some pre-processing steps performed on the newspaper report dataset are as follows:

Remove hyperlinks.

Remove non-English alphanumeric characters.

Remove stop words.

Lemmatize text.

While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.

The primary data statistics of the two dataset are shown in Table 1 and 2.

Table 1: Covid-News-USA-NNK data statistics

No of words per headline

7 to 20

No of words per body content

150 to 2100

Table 2: Covid-News-BD-NNK data statistics No of words per headline

10 to 20

No of words per body content

100 to 1500

2.3 Dataset Repository

We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.

3 Literature Review

Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.

Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].

Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.

Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.

4 Our experiments and Result analysis

We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:

In February, both the news paper have talked about China and source of the outbreak.

StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.

Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.

Washington Post discussed global issues more than StarTribune.

StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.

While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.

We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases

where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,
f
Association between Stock Market Gains and Losses and Google Searches
figshare.com
datadryad.org
doc
Updated Jun 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eli Arditi; Eldad Yechiam; Gal Zahavi (2023). Association between Stock Market Gains and Losses and Google Searches [Dataset]. http://doi.org/10.1371/journal.pone.0141354
Explore at:
docAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0141354
Dataset updated
Jun 4, 2023
Dataset provided by
PLOS ONE
Authors
Eli Arditi; Eldad Yechiam; Gal Zahavi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Experimental studies in the area of Psychology and Behavioral Economics have suggested that people change their search pattern in response to positive and negative events. Using Internet search data provided by Google, we investigated the relationship between stock-specific events and related Google searches. We studied daily data from 13 stocks from the Dow-Jones and NASDAQ100 indices, over a period of 4 trading years. Focusing on periods in which stocks were extensively searched (Intensive Search Periods), we found a correlation between the magnitude of stock returns at the beginning of the period and the volume, peak, and duration of search generated during the period. This relation between magnitudes of stock returns and subsequent searches was considerably magnified in periods following negative stock returns. Yet, we did not find that intensive search periods following losses were associated with more Google searches than periods following gains. Thus, rather than increasing search, losses improved the fit between people’s search behavior and the extent of real-world events triggering the search. The findings demonstrate the robustness of the attentional effect of losses.
n
Daily United States COVID-19 Testing and Outcomes Data By State, March 7,...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Jul 28, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The COVID Tracking Project at The Atlantic (2021). Daily United States COVID-19 Testing and Outcomes Data By State, March 7, 2020 to March 7, 2021 [Dataset]. http://doi.org/10.5061/dryad.9kd51c5hk
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.9kd51c5hk
Dataset updated
Jul 28, 2021
Dataset provided by
.
Authors
The COVID Tracking Project at The Atlantic
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
United States
Description
The COVID Tracking Project was a volunteer organization launched from The Atlantic and dedicated to collecting and publishing the data required to understand the COVID-19 outbreak in the United States. Our dataset was in use by national and local news organizations across the United States and by research projects and agencies worldwide.

Every day, we collected data on COVID-19 testing and patient outcomes from all 50 states, 5 territories, and the District of Columbia by visiting official public health websites for those jurisdictions and entering reported values in a spreadsheet. The files in this dataset represent the entirety of our COVID-19 testing and outcomes data collection from March 7, 2020 to March 7, 2021. This dataset includes official values reported by each state on each day of antigen, antibody, and PCR test result totals; the total number of probable and confirmed cases of COVID-19; the number of people currently hospitalized, in intensive care, and on a ventilator; the total number of confirmed and probable COVID-19 deaths; and more.

Methods This dataset was compiled by about 300 volunteers with The COVID Tracking Project from official sources of state-level COVID-19 data such as websites and press conferences. Every day, a team of about a dozen available volunteers visited these official sources and recorded the publicly reported values in a shared Google Sheet, which was used as a data source to publish the full dataset each day between about 5:30pm and 7pm Eastern time. All our data came from state and territory public health authorities or official statements from state officials. We did not automatically scrape data or attempt to offer a live feed. Our data was gathered and double-checked by humans, and we emphasized accuracy and context over speed. Some data was corrected or backfilled from structured data provided by public health authorities. Additional information about our methods can be found in a series of posts at http://covidtracking.com/analysis-updates.

We offer thanks and heartfelt gratitude for the labor and sacrifice of our volunteers. Volunteers on the Data Entry, Data Quality, and Data Infrastructure teams who granted us permission to use their name publicly are listed in VOLUNTEERS.md.
Search Engines in Germany - Market Research Report (2015-2030)
ibisworld.com
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IBISWorld (2024). Search Engines in Germany - Market Research Report (2015-2030) [Dataset]. https://www.ibisworld.com/germany/industry/search-engines/935/
Explore at:
Dataset updated
Jun 19, 2024
Dataset authored and provided by
IBISWorld
License
https://www.ibisworld.com/about/termsofuse/https://www.ibisworld.com/about/termsofuse/
Time period covered
2014 - 2029
Area covered
Germany
Description
In the last five years, the web portal industry has recorded significant revenue growth. Industry revenue increased by an average of 3.8% per year between 2019 and 2024 and is expected to reach 12.6 billion euros in the current year. The web portal industry comprises a variety of platforms such as social networks, search engines, video platforms and email services that are used by millions of users every day. These portals enable the exchange of information and communication as well as entertainment. Web portals generate their revenue mainly through advertising, premium services and commission payments. User numbers are rising steadily as more and more people go online and everyday processes are increasingly digitalised.In 2024, industry revenue is expected to increase by 3.2 %. Although the industry is growing, it is also facing challenges, particularly in terms of data protection. Web portals are constantly collecting user data, which can lead to misuse of the collected data. The General Data Protection Regulation (GDPR) introduced in the European Union in 2018 has prompted web portal operators to review their data protection practices and amend their terms and conditions in order to avoid fines. The aim of this regulation is to improve the protection of personal data and prevent data misuse.The industry's turnover is expected to increase by an average of 3.6% per year to 15 billion euros over the next five years. Video platforms such as YouTube often generate losses despite high user numbers. The reasons for this are the high costs of operation and infrastructure as well as expenses for copyright issues and compliance. Advertising on video platforms is perceived negatively by users, but is successful when it comes to attracting attention. Politicians are debating the taxation of revenues generated by internationally operating web portals based in tax havens. Another challenge is the copying of concepts, which inhibits innovation in the industry and can lead to legal problems.
g
Usage metrics of the TousAntiCovid application | gimi9.com
gimi9.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Usage metrics of the TousAntiCovid application | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_5fa93b994b29f6390f150980_1
Explore at:
Description
The TousAntiCovid app TousAntiCovid is an application that allows everyone to be an actor in the fight against the epidemic. This is an additional barrier gesture that is activated at all times when you have to redouble your vigilance: at the restaurant, in the canteen, when you go to a gym, when you participate in a professional event, when there is a risk that not everyone will respect the other barrier gestures. TousAntiCovid complements the action of doctors and sickness insurance, aimed at containing the spread of the virus by stopping the chains of contamination as soon as possible. The principle is as follows: prevent, while guaranteeing anonymity, people who have been close to a person tested positive, so that they can get tested and taken care of as soon as possible. It also makes it possible to stay informed about the evolution of the epidemic and the conduct to be held and thus to remain vigilant and adopt the right actions. It allows easy access to other tools available to citizens wishing to be involved in the fight against the epidemic: DepistageCovid which gives map of nearby labs and wait times and MesConseilsCovid which provides personalised advice to protect and protect others. The installation of the TousAntiCovid app is done on a voluntary basis. Everyone is supported even if they choose not to use the app. The app is downloaded from the Apple Store and Google Play: Hello.tousanticovid.gouv.fr/ ### Description of the data This dataset informs for each day since the launch of the application on 2 June 2020: — Cumulative total of the number of registered applications minus the number of deregistrations. — Cumulative total of users notified by the application: the number of users notified by the application as risk contacts following exposure to COVID-19, since 2 June 2020. — Cumulative total of users reporting as COVID-19 cases per day: the number of users who reported as COVID-19 cases in the application, since 2 June 2020.
Data (i.e., evidence) about evidence based medicine
figshare.com
search.datacite.org
png
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jorge H Ramirez (2023). Data (i.e., evidence) about evidence based medicine [Dataset]. http://doi.org/10.6084/m9.figshare.1093997.v24
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1093997.v24
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Jorge H Ramirez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Update — December 7, 2014. – Evidence-based medicine (EBM) is not working for many reasons, for example: 1. Incorrect in their foundations (paradox): hierarchical levels of evidence are supported by opinions (i.e., lowest strength of evidence according to EBM) instead of real data collected from different types of study designs (i.e., evidence). http://dx.doi.org/10.6084/m9.figshare.1122534 2. The effect of criminal practices by pharmaceutical companies is only possible because of the complicity of others: healthcare systems, professional associations, governmental and academic institutions. Pharmaceutical companies also corrupt at the personal level, politicians and political parties are on their payroll, medical professionals seduced by different types of gifts in exchange of prescriptions (i.e., bribery) which very likely results in patients not receiving the proper treatment for their disease, many times there is no such thing: healthy persons not needing pharmacological treatments of any kind are constantly misdiagnosed and treated with unnecessary drugs. Some medical professionals are converted in K.O.L. which is only a puppet appearing on stage to spread lies to their peers, a person supposedly trained to improve the well-being of others, now deceits on behalf of pharmaceutical companies. Probably the saddest thing is that many honest doctors are being misled by these lies created by the rules of pharmaceutical marketing instead of scientific, medical, and ethical principles. Interpretation of EBM in this context was not anticipated by their creators. “The main reason we take so many drugs is that drug companies don’t sell drugs, they sell lies about drugs.” ―Peter C. Gøtzsche “doctors and their organisations should recognise that it is unethical to receive money that has been earned in part through crimes that have harmed those people whose interests doctors are expected to take care of. Many crimes would be impossible to carry out if doctors weren’t willing to participate in them.” —Peter C Gøtzsche, The BMJ, 2012, Big pharma often commits corporate crime, and this must be stopped. Pending (Colombia): Health Promoter Entities (In Spanish: EPS ―Empresas Promotoras de Salud).

Misinterpretations New technologies or concepts are difficult to understand in the beginning, it doesn’t matter their simplicity, we need to get used to new tools aimed to improve our professional practice. Probably the best explanation is here in these videos (credits to Antonio Villafaina for sharing these videos with me). English https://www.youtube.com/watch?v=pQHX-SjgQvQ&w=420&h=315 Spanish https://www.youtube.com/watch?v=DApozQBrlhU&w=420&h=315 ----------------------- Hypothesis: hierarchical levels of evidence based medicine are wrong Dear Editor, I have data to support the hypothesis described in the title of this letter. Before rejecting the null hypothesis I would like to ask the following open question:Could you support with data that hierarchical levels of evidence based medicine are correct? (1,2) Additional explanation to this question: – Only respond to this question attaching publicly available raw data.– Be aware that more than a question this is a challenge: I have data (i.e., evidence) which is contrary to classic (i.e., McMaster) or current (i.e., Oxford) hierarchical levels of evidence based medicine. An important part of this data (but not all) is publicly available. References

Ramirez, Jorge H (2014): The EBM challenge. figshare. http://dx.doi.org/10.6084/m9.figshare.1135873

The EBM Challenge Day 1: No Answers. Competing interests: I endorse the principles of open data in human biomedical research Read this letter on The BMJ – August 13, 2014.http://www.bmj.com/content/348/bmj.g3725/rr/762595Re: Greenhalgh T, et al. Evidence based medicine: a movement in crisis? BMJ 2014; 348: g3725. _ Fileset contents Raw data: Excel archive: Raw data, interactive figures, and PubMed search terms. Google Spreadsheet is also available (URL below the article description). Figure 1. Unadjusted (Fig 1A) and adjusted (Fig 1B) PubMed publication trends (01/01/1992 to 30/06/2014). Figure 2. Adjusted PubMed publication trends (07/01/2008 to 29/06/2014) Figure 3. Google search trends: Jan 2004 to Jun 2014 / 1-week periods. Figure 4. PubMed publication trends (1962-2013) systematic reviews and meta-analysis, clinical trials, and observational studies.
Figure 5. Ramirez, Jorge H (2014): Infographics: Unpublished US phase 3 clinical trials (2002-2014) completed before Jan 2011 = 50.8%. figshare.http://dx.doi.org/10.6084/m9.figshare.1121675 Raw data: "13377 studies found for: Completed | Interventional Studies | Phase 3 | received from 01/01/2002 to 01/01/2014 | Worldwide". This database complies with the terms and conditions of ClinicalTrials.gov: http://clinicaltrials.gov/ct2/about-site/terms-conditions Supplementary Figures (S1-S6). PubMed publication delay in the indexation processes does not explain the descending trends in the scientific output of evidence-based medicine. Acknowledgments I would like to acknowledge the following persons for providing valuable concepts in data visualization and infographics:

Maria Fernanda Ramírez. Professor of graphic design. Universidad del Valle. Cali, Colombia.

Lorena Franco. Graphic design student. Universidad del Valle. Cali, Colombia. Related articles by this author (Jorge H. Ramírez)

Ramirez JH. Lack of transparency in clinical trials: a call for action. Colomb Med (Cali) 2013;44(4):243-6. URL: http://www.ncbi.nlm.nih.gov/pubmed/24892242

Ramirez JH. Re: Evidence based medicine is broken (17 June 2014). http://www.bmj.com/node/759181

Ramirez JH. Re: Global rules for global health: why we need an independent, impartial WHO (19 June 2014). http://www.bmj.com/node/759151

Ramirez JH. PubMed publication trends (1992 to 2014): evidence based medicine and clinical practice guidelines (04 July 2014). http://www.bmj.com/content/348/bmj.g3725/rr/759895 Recommended articles

Greenhalgh Trisha, Howick Jeremy,Maskrey Neal. Evidence based medicine: a movement in crisis? BMJ 2014;348:g3725

Spence Des. Evidence based medicine is broken BMJ 2014; 348:g22

Schünemann Holger J, Oxman Andrew D,Brozek Jan, Glasziou Paul, JaeschkeRoman, Vist Gunn E et al. Grading quality of evidence and strength of recommendations for diagnostic tests and strategies BMJ 2008; 336:1106

Lau Joseph, Ioannidis John P A, TerrinNorma, Schmid Christopher H, OlkinIngram. The case of the misleading funnel plot BMJ 2006; 333:597

Moynihan R, Henry D, Moons KGM (2014) Using Evidence to Combat Overdiagnosis and Overtreatment: Evaluating Treatments, Tests, and Disease Definitions in the Time of Too Much. PLoS Med 11(7): e1001655. doi:10.1371/journal.pmed.1001655

Katz D. A-holistic view of evidence based medicinehttp://thehealthcareblog.com/blog/2014/05/02/a-holistic-view-of-evidence-based-medicine/ ---
Cyclistic Ride Trips Dataset
kaggle.com
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ishansh Maurya (2025). Cyclistic Ride Trips Dataset [Dataset]. https://www.kaggle.com/datasets/vishanshm/cyclistic-ride-trips-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ishansh Maurya
Description
The original dataset has been transformed as contains the following columns:

ride_id: Unique ID for each ride rideable_type: Type of bike (classic_bike, electric_bike) member_casual: Rider type (member, casual) ride_length: Duration of the ride in HH:MM:SS format weekday: Day of the week (1 = Monday, …, 7 = Sunday; 6–7 = weekend) hour: Hour of the day (0 to 23) month: Month of the ride (1 = April 2024, …, 12 = March 2025)
Z
Data from: EyeFi: Fast Human Identification Through Vision and WiFi-based...
data.niaid.nih.gov
Updated Dec 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shiwei Fang (2022). EyeFi: Fast Human Identification Through Vision and WiFi-based Trajectory Matching [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3882103
Explore at:
Dataset updated
Dec 4, 2022
Dataset provided by
Shiwei Fang
Tamzeed Islam
Sirajum Munir
Shahriar Nirjon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
EyeFi Dataset

This dataset is collected as a part of the EyeFi project at Bosch Research and Technology Center, Pittsburgh, PA, USA. The dataset contains WiFi CSI values of human motion trajectories along with ground truth location information captured through a camera. This dataset is used in the following paper "EyeFi: Fast Human Identification Through Vision and WiFi-based Trajectory Matching" that is published in the IEEE International Conference on Distributed Computing in Sensor Systems 2020 (DCOSS '20). We also published a dataset paper titled as "Dataset: Person Tracking and Identification using Cameras and Wi-Fi Channel State Information (CSI) from Smartphones" in Data: Acquisition to Analysis 2020 (DATA '20) workshop describing details of data collection. Please check it out for more information on the dataset.

Clarification/Bug report: Please note that the order of antennas and subcarriers in .h5 files is not written clearly in the README.md file. The order of antennas and subcarriers are as follows for the 90 csi_real and csi_imag values : [subcarrier1-antenna1, subcarrier1-antenna2, subcarrier1-antenna3, subcarrier2-antenna1, subcarrier2-antenna2, subcarrier2-antenna3,… subcarrier30-antenna1, subcarrier30-antenna2, subcarrier30-antenna3]. Please see the description below. The newer version of the dataset contains this information in README.md. We are sorry for the inconvenience.

Data Collection Setup

In our experiments, we used Intel 5300 WiFi Network Interface Card (NIC) installed in an Intel NUC and Linux CSI tools [1] to extract the WiFi CSI packets. The (x,y) coordinates of the subjects are collected from Bosch Flexidome IP Panoramic 7000 panoramic camera mounted on the ceiling and Angle of Arrivals (AoAs) are derived from the (x,y) coordinates. Both the WiFi card and camera are located at the same origin coordinates but at different height, the camera is location around 2.85m from the ground and WiFi antennas are around 1.12m above the ground.

The data collection environment consists of two areas, first one is a rectangular space measured 11.8m x 8.74m, and the second space is an irregularly shaped kitchen area with maximum distances of 19.74m and 14.24m between two walls. The kitchen also has numerous obstacles and different materials that pose different RF reflection characteristics including strong reflectors such as metal refrigerators and dishwashers.

To collect the WiFi data, we used a Google Pixel 2 XL smartphone as an access point and connect the Intel 5300 NIC to it for WiFi communication. The transmission rate is about 20-25 packets per second. The same WiFi card and phone are used in both lab and kitchen area.

List of Files Here is a list of files included in the dataset:

|- 1_person |- 1_person_1.h5 |- 1_person_2.h5 |- 2_people |- 2_people_1.h5 |- 2_people_2.h5 |- 2_people_3.h5 |- 3_people |- 3_people_1.h5 |- 3_people_2.h5 |- 3_people_3.h5 |- 5_people |- 5_people_1.h5 |- 5_people_2.h5 |- 5_people_3.h5 |- 5_people_4.h5 |- 10_people |- 10_people_1.h5 |- 10_people_2.h5 |- 10_people_3.h5 |- Kitchen |- 1_person |- kitchen_1_person_1.h5 |- kitchen_1_person_2.h5 |- kitchen_1_person_3.h5 |- 3_people |- kitchen_3_people_1.h5 |- training |- shuffuled_train.h5 |- shuffuled_valid.h5 |- shuffuled_test.h5 View-Dataset-Example.ipynb README.md

In this dataset, folder 1_person/ , 2_people/ , 3_people/ , 5_people/, and 10_people/ contains data collected from the lab area whereas Kitchen/ folder contains data collected from the kitchen area. To see how the each file is structured, please see below in section Access the data.

The training folder contains the training dataset we used to train the neural network discussed in our paper. They are generated by shuffling all the data from 1_person/ folder collected in the lab area (1_person_1.h5 and 1_person_2.h5).

Why multiple files in one folder?

Each folder contains multiple files. For example, 1_person folder has two files: 1_person_1.h5 and 1_person_2.h5. Files in the same folder always have the same number of human subjects present simultaneously in the scene. However, the person who is holding the phone can be different. Also, the data could be collected through different days and/or the data collection system needs to be rebooted due to stability issue. As result, we provided different files (like 1_person_1.h5, 1_person_2.h5) to distinguish different person who is holding the phone and possible system reboot that introduces different phase offsets (see below) in the system.

Special note:

For 1_person_1.h5, this file is generated by the same person who is holding the phone, and 1_person_2.h5 contains different people holding the phone but only one person is present in the area at a time. Boths files are collected in different days as well.

Access the data To access the data, hdf5 library is needed to open the dataset. There are free HDF5 viewer available on the official website: https://www.hdfgroup.org/downloads/hdfview/. We also provide an example Python code View-Dataset-Example.ipynb to demonstrate how to access the data.

Each file is structured as (except the files under "training/" folder):

|- csi_imag |- csi_real |- nPaths_1 |- offset_00 |- spotfi_aoa |- offset_11 |- spotfi_aoa |- offset_12 |- spotfi_aoa |- offset_21 |- spotfi_aoa |- offset_22 |- spotfi_aoa |- nPaths_2 |- offset_00 |- spotfi_aoa |- offset_11 |- spotfi_aoa |- offset_12 |- spotfi_aoa |- offset_21 |- spotfi_aoa |- offset_22 |- spotfi_aoa |- nPaths_3 |- offset_00 |- spotfi_aoa |- offset_11 |- spotfi_aoa |- offset_12 |- spotfi_aoa |- offset_21 |- spotfi_aoa |- offset_22 |- spotfi_aoa |- nPaths_4 |- offset_00 |- spotfi_aoa |- offset_11 |- spotfi_aoa |- offset_12 |- spotfi_aoa |- offset_21 |- spotfi_aoa |- offset_22 |- spotfi_aoa |- num_obj |- obj_0 |- cam_aoa |- coordinates |- obj_1 |- cam_aoa |- coordinates ... |- timestamp

The csi_real and csi_imag are the real and imagenary part of the CSI measurements. The order of antennas and subcarriers are as follows for the 90 csi_real and csi_imag values : [subcarrier1-antenna1, subcarrier1-antenna2, subcarrier1-antenna3, subcarrier2-antenna1, subcarrier2-antenna2, subcarrier2-antenna3,… subcarrier30-antenna1, subcarrier30-antenna2, subcarrier30-antenna3]. nPaths_x group are SpotFi [2] calculated WiFi Angle of Arrival (AoA) with x number of multiple paths specified during calculation. Under the nPath_x group are offset_xx subgroup where xx stands for the offset combination used to correct the phase offset during the SpotFi calculation. We measured the offsets as:

Antennas Offset 1 (rad) Offset 2 (rad)
1 & 2 1.1899 -2.0071
1 & 3 1.3883 -1.8129

The measurement is based on the work [3], where the authors state there are two possible offsets between two antennas which we measured by booting the device multiple times. The combination of the offset are used for the offset_xx naming. For example, offset_12 is offset 1 between antenna 1 & 2 and offset 2 between antenna 1 & 3 are used in the SpotFi calculation.

The num_obj field is used to store the number of human subjects present in the scene. The obj_0 is always the subject who is holding the phone. In each file, there are num_obj of obj_x. For each obj_x1, we have the coordinates reported from the camera and cam_aoa, which is estimated AoA from the camera reported coordinates. The (x,y) coordinates and AoA listed here are chronologically ordered (except the files in the training folder) . It reflects the way the person carried the phone moved in the space (for obj_0) and everyone else walked (for other obj_y, where y > 0).

The timestamp is provided here for time reference for each WiFi packets.

To access the data (Python):

import h5py

data = h5py.File('3_people_3.h5','r')

csi_real = data['csi_real'][()] csi_imag = data['csi_imag'][()]

cam_aoa = data['obj_0/cam_aoa'][()] cam_loc = data['obj_0/coordinates'][()]

For file inside training/ folder:

Files inside training folder has a different data structure:

|- nPath-1 |- aoa |- csi_imag |- csi_real |- spotfi |- nPath-2 |- aoa |- csi_imag |- csi_real |- spotfi |- nPath-3 |- aoa |- csi_imag |- csi_real |- spotfi |- nPath-4 |- aoa |- csi_imag |- csi_real |- spotfi

The group nPath-x is the number of multiple path specified during the SpotFi calculation. aoa is the camera generated angle of arrival (AoA) (can be considered as ground truth), csi_image and csi_real is the imaginary and real component of the CSI value. spotfi is the SpotFi calculated AoA values. The SpotFi values are chosen based on the lowest median and mean error from across 1_person_1.h5 and 1_person_2.h5. All the rows under the same nPath-x group are aligned (i.e., first row of aoa corresponds to the first row of csi_imag, csi_real, and spotfi. There is no timestamp recorded and the sequence of the data is not chronological as they are randomly shuffled from the 1_person_1.h5 and 1_person_2.h5 files.

Citation If you use the dataset, please cite our paper:

@inproceedings{eyefi2020, title={EyeFi: Fast Human Identification Through Vision and WiFi-based Trajectory Matching}, author={Fang, Shiwei and Islam, Tamzeed and Munir, Sirajum and Nirjon, Shahriar}, booktitle={2020 IEEE International Conference on Distributed Computing in Sensor Systems (DCOSS)},
Day & night temperatures, 50yrs, 1666ws, TFRecord
kaggle.com
zip
Updated Nov 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Görner (2019). Day & night temperatures, 50yrs, 1666ws, TFRecord [Dataset]. https://www.kaggle.com/datasets/mgorner/day-night-temperatures-50yrs-1666ws-tfrecord
Explore at:
zip(160157825 bytes)Available download formats
Dataset updated
Nov 9, 2019
Authors
Martin Görner
License
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
Description
This dataset is a cleaned-up extract from the following public BigQuery dataset: https://console.cloud.google.com/marketplace/details/noaa-public/ghcn-d

The dataset contains daily min/max temperatures from a selection of 1666 weather stations. The data spans exactly 50 years. Missing values have been interpolated and are marked as such.

This dataset is in TFRecord format.

About the original dataset: NOAA’s Global Historical Climatology Network (GHCN) is an integrated database of climate summaries from land surface stations across the globe that have been subjected to a common suite of quality assurance reviews. The data are obtained from more than 20 sources. The GHCN-Daily is an integrated database of daily climate summaries from land surface stations across the globe, and is comprised of daily climate records from over 100,000 stations in 180 countries and territories, and includes some data from every year since 1763.
g
Demographics
health.google.com
Updated Oct 7, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Demographics [Dataset]. https://health.google.com/covid-19/open-data/raw-data
Explore at:
Dataset updated
Oct 7, 2021
Variables measured
key, population, population_male, rural_population, urban_population, population_female, population_density, clustered_population, population_age_00_09, population_age_10_19, and 11 more
Description
Various population statistics, including structured demographics data.
Bellabeat - Case Study (Google Career Certificate)
kaggle.com
Updated Feb 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandra Loop (2024). Bellabeat - Case Study (Google Career Certificate) [Dataset]. https://www.kaggle.com/datasets/alexandraloop/bellabeat-case-study-google-career-certificate/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alexandra Loop
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Analyst: Alexandra Loop Date: 12/02/2024

Business Task:

Question to be Answered : - What are trends in non-Bellabeat smart device usage? - What do these trends suggest for Bellabeat customers? - How could these trends help influence Bellabeat marketing strategy?

Description of Data Sources:

Data Set to be studied: FitBit Fitness Tracker Data: Pattern Recognition with tracker data: Improve Your Overall Health

Data privacy: Data was sourced from a public dataset available on Kaggle. Information has been anonymized prior to being posted online.

Bias: Due to the degree of anonymity in this study, the only demographic data available in this study is weight, and other cultural differences or lifestyle requirements cannot be accounted for. The sample size is quite small. The time period of the study is only a month so the observer effect could conceivably still be influencing the sample groups. We also have no information on the weather in the region studied. April and May are very variable months in terms of accessible outdoor activities.

Process:

Cleaning Process: After going through the data to find duplicates, whitespace, and nulls, I have determined that this set of data has been well-cleaned and already aggregated into several reasonably sized spreadsheets.

Trim: No issues found

Consistent length ID: No issues found

Irrelevant columns: In WLI_M the fat column is not consistently filled in so it is not productive to use it in analysis Sedentary_active_distance was mostly filled with nulls and could confuse the data I have removed the columns

Irrelevant Rows: 77 rows in daily_Activity_merged had 0s across the board. As there is little chance that someone would take zero steps I decided to interpret these days as ones where people did not put on the fitbit. As such they are irrelevant rows. Removed 77 columns. 85 rows in daily_intensities_merged registered 0 minutes of sedentary activity, which I do not believe to be possible. Row 241 logged 2 minutes of sedentary activity. I have determined it to be unusable. Row 322 likewise does not add up to a day’s minutes and has been deleted. Removed 85 columns 7 rows had 1440 sedentary minutes, which I have determined to be time on but not used. Implication of the presence noted.

Scientifically debunked information: BMI as a measurement has been determined to be problematic on many lines, it misrepresents non-white people who have different healthy body types, does not account for muscle mass or scoliosis, has been known to change definitions in accordance with business interests rather than health data, and was never meant to be used as a measure of individual health. I have removed the BMI column from the Weight Log Info chart.

Cleaning Process 1: I have elected to see what can be found in the data as it was organized by the providers first.
Cleaning Process 2: I calculated and removed rows where the participants did not put on the fitbit. These rows were removed, and the implications of their presence have been noted. Found Averages, Minimum, and Maximum Values of Steps, distance, types of active minutes, and calories. Found the sum of all kinds of minutes documented to check for inconsistencies. Found the difference between total minutes and a full 1440 minutes. I tried to make a pie chart to convey the average minutes of activity, and so created a duplicate dataset to trim down and remove misleading data caused by different inputs.

Analysis:

Observations: On average, the participants do not seem interested in moderate physical activity as it was the category with the fewest number of active minutes. Perhaps advertise the effectiveness of low impact workouts. Very few participants volunteered their weights, but none of them lost weight. The person with the highest weight volunteered it only once near the beginning. Given evidence from the Health At Every Size movement, we cannot deny the possibility that having to be weight conscious could have had negative effects on this individual. I would suggest that weight would be a counterproductive focus for our marketing campaign as it would make heavier people less likely to want to participate, and any claims of weight loss would be statistically unfounded, and open us up to false advertising lawsuits. Fully half of the participants had days where they did not put on their fitbit at all during the day. For a total number of 77-84 lost days of data, meaning that on average participants who did not wear their fitbit daily lost 5 days of data, though of course some lost significantly more. I would suggest focusing on creating a biometric tracker that is comfortable and rarely needs to be charged so that people will gain more reliable resources from it. 400 full days of data are recorded, meaning that the participants did not take the device off to sleep, shower, or swim. 280 more have 16...
Data from: Menstrual Cycle Data
kaggle.com
Updated Aug 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikita Bisht (2021). Menstrual Cycle Data [Dataset]. https://www.kaggle.com/nikitabisht/menstrual-cycle-data/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 17, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nikita Bisht
Description
Context

Periods usually arrive every month in a woman’s life. But we all are so busy with mundane work that we tend to forget our period dates. Moreover, most of the women have such an inconsistent cycle that it is worthless for them to remember their previous dates. This dataset can be used to create apps and websites to predict the menstrual days more efficiently.

Content

The dataset has 80 columns and I downloaded this from https://epublications.marquette.edu/data_nfp/7/ . I cleaned this dataset a lot and then made a webapp to predict the menstrual cycle and ovulation days of a woman. Though the dataset that I have uploaded is original and not the one that I cleaned.

Acknowledgements

The whole credit for this dataset goes to those people who made it and uploaded it on this website https://epublications.marquette.edu/data_nfp/7/ . I have just downloaded the dataset and uploaded it on Kaggle.

Inspiration

Before this, I was unable to find a dataset on the menstrual cycle. I searched so many websites like Kaggle, Google datasets, arxiv and went through a lot of research papers but the dataset that I was looking for was nowhere to be found. Then I stumbled on this publication which made it possible for me to build an app that enables women to get to know more about their bodies. So I am uploading it here so that all of you don't have to go through thousands of datasets that might not be of great use to you.
Stock Market
kaggle.com
Updated Sep 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nguyễn Ngô Minh Đức (2022). Stock Market [Dataset]. https://www.kaggle.com/nguynngminhc/stock-market/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 27, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nguyễn Ngô Minh Đức
Description
This is a dataset of stocks of the four giants Apple, Amazon, Microsoft, and Google. Some suggestions for you to start with is to analyze the closing price and trading volume Daily stock changes v.v Gojo and Getou DA

Antennas	Offset 1 (rad)	Offset 2 (rad)
1 & 2	1.1899	-2.0071
1 & 3	1.3883	-1.8129

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2024). Average daily time spent on social media worldwide 2012-2024 [Dataset]. https://www.statista.com/statistics/433871/daily-social-media-usage-worldwide/

Average daily time spent on social media worldwide 2012-2024

Explore at:

Dataset updated

Apr 10, 2024

Dataset authored and provided by

Statistahttp://statista.com/

Area covered

Worldwide

Description

How much time do people spend on social media? As of 2024, the average daily social media usage of internet users worldwide amounted to 143 minutes per day, down from 151 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of three hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in the U.S. was just two hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively. People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general. During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.

Clear search

Close search

Google apps

Main menu

Average daily time spent on social media worldwide 2012-2024

Data from: San Francisco Open Data

Context

Content

Acknowledgements

Inspiration

COVID19 - The New York Times

Context

Sample Queries

Query 1

Query 2

NOAA GSOD

Overview

Content

Querying BigQuery tables

Acknowledgements

FiveThirtyEight Daily Show Guests Dataset

Content

Daily Show Guests

Context

Acknowledgements

Google Stock Price Data

Covid19_ChineseSocialMedia_Hotspots

Context

Sina Weibo

Douyin

Toutiao

Douban

Content

INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET

Association between Stock Market Gains and Losses and Google Searches

Daily United States COVID-19 Testing and Outcomes Data By State, March 7,...

Search Engines in Germany - Market Research Report (2015-2030)

Usage metrics of the TousAntiCovid application | gimi9.com

Data (i.e., evidence) about evidence based medicine

Cyclistic Ride Trips Dataset

Data from: EyeFi: Fast Human Identification Through Vision and WiFi-based...

Day & night temperatures, 50yrs, 1666ws, TFRecord

Demographics

Bellabeat - Case Study (Google Career Certificate)

Data from: Menstrual Cycle Data

Context

Content

Acknowledgements

Inspiration

Stock Market

Average daily time spent on social media worldwide 2012-2024See More Versions

Average daily time spent on social media worldwide 2012-2024