Fitness tracking reveals task-specific associations between memory, mental health, and physical activity
We ran an online experiment using the Amazon Mechanical Turk (MTurk) platform31. We collected data about each participant’s fitness and physical activity habits, a variety of self-reported measures concerning their mental health, and about their performance on a battery of memory tasks.
We recruited experimental participants by posting our experiment as a Human Intelligence Task (HIT) on the MTurk platform. We limited participation to MTurk Workers who had been assigned a “master worker” designation on the platform, given to workers who score highly across several metrics on a large number of HITs, according to a proprietary algorithm managed by Amazon. One criterion embedded into the algorithm is a requirement that master workers must maintain a HIT acceptance rate of at least 95%. We further limited our participant pool to participants who self-reported that they were fluent in English and regularly used a Fitbit fitness tracker device. A total of 160 workers accepted our HIT in order to participate in our experiment. Of these, we excluded all participants who failed to log into their Fitbit account (giving us access to their anonymized fitness tracking data), encountered technical issues (e.g., by accessing the HIT using an incompatible browser, device, or operating system), or who ended their participation prematurely, before completing the full study. In all, 113 participants contributed usable data to the study.
For their participation, workers received a base payment of $5 per hour (computed in 15 min increments, rounded up to the nearest 15 min), plus an additional performance-based bonus of up to $5. Our recruitment procedure and study protocol were approved by Dartmouth’s Committee for the Protection of Human Subjects. We obtained informed consent using an online form administered to all prospective participants prior to enrolling them in our study. All methods were performed in accordance with the relevant guidelines and regulations.
Gender, age, and race
Of the 113 participants who contributed usable data, 77 reported their gender as female, 35 as male, and 1 chose not to report their gender. Participants ranged in age from 19 to 68 years old (25th percentile: 28.25 years; 50th percentile: 32 years; 75th percentile: 38 years). Participants reported their race as White (90 participants), Black or African American (11 participants), Asian (7 participants), Other (4 participants), and American Indian or Alaska Native (3 participants). One participant opted not to report their race.
All participants reported that they were fluent in either 1 or 2 languages (25th percentile: 1; 50th percentile: 1; 75th percentile: 1), and that they were “familiar” with between 1 and 11 languages (25th percentile: 1; 50th percentile: 2; 75th percentile: 3).
Reported medical conditions and medications
Participants reported having and/or taking medications pertaining to the following medical conditions: anxiety or depression (4 participants), recent head injury (2 participants), high blood pressure (1 participant), bipolar disorder (1 participant), hypothyroidism (1 participant), and other unspecified conditions or medications (1 participant). Participants reported their current and typical stress levels on a Likert scale as very relaxed (− 2), a little relaxed (− 1), neutral (0), a little stressed (1), or very stressed (2). The “current” stress level reflected participants’ stress at the time they participated in the experiment. Their responses ranged from − 2 to 2 (current stress: 25th percentile: − 2; 50th percentile: − 1; 75th percentile: 1; typical stress: 25th percentile: 0; 50th percentile: 1; 75th percentile: 1). Participants also reported their current level of alertness on a Likert scale as very sluggish (− 2), a little sluggish (− 1), neutral (0), a little alert (1), or very alert (2). Their responses ranged from − 2 to 2 (25th percentile: 0; 50th percentile: 1; 75th percentile: 2). Nearly all (111 out of 113) participants reported that they had normal color vision, and 15 participants reported uncorrected visual impairments (including dyslexia and uncorrected near- or far-sightedness).
Residence and level of education
Participants reported their residence as being located in the suburbs (36 participants), a large city (30 participants), a small city (23 participants), rural (14 participants), or a small town (10 participants). Participants reported their level of education as follows: College graduate (42 participants), Master’s degree (23 participants), Some college (21 participants), High school graduate (9 participants), Associate’s degree (8 participants), Other graduate or professional school (5 participants), Some graduate training (3 participants), or Doctorate (2 participants).
Reported water and coffee intake
Participants reported the number of 8 oz cups of water and coffee they had consumed prior to accepting the HIT. Water consumption ranged from 0 to 6 cups (25th percentile: 1; 50th percentile: 3; 75th percentile: 4). Coffee consumption ranged from 0 to 4 cups (25th percentile: 0; 50th percentile: 1; 75th percentile: 2).
Upon accepting the HIT posted on MTurk, each worker was directed to read and fill out a screening and consent form, and to share access to their anonymized Fitbit data via their Fitbit account. After consenting to participate in our study and successfully sharing their Fitbit data, participants filled out a survey and then engaged in a series of memory tasks (Fig. 1). All stimuli and code for running the full MTurk experiment may be found at https://github.com/ContextLab/brainfit-task.
Battery of memory tasks. (a) Free recall. Participants study 16 words (presented one at a time), followed by an immediate memory test where they type each word they remember from the just-studied list. In the delayed memory test, participants type any words they remember studying, from any list. (b) Naturalistic recall. Participants watch a brief video, followed by two immediate memory tests. The first test asks participants to write out what happened in the video. The second test has participants answer a series of multiple choice questions about the conceptual content of the video. In the delayed memory test, participants (again) write out what happened in the video. (c) Foreign language flashcards. Participants study a sequence of 10 English-Gaelic word pairs, each presented with an illustration of the given word. During an immediate memory test, participants perform a multiple choice test where they select the Gaelic word that corresponds to the given photograph. During the delayed memory test, participants perform a second multiple choice test, where they select the Gaelic word that corresponds to each of a new set of photographs. (d) Spatial learning. In each trial, participants study a set of randomly positioned shapes. Next, the shapes’ positions are altered, and participants are asked to drag the shapes back to their previous positions. All panels. The gray numbers denote the order in which participants experienced each task or test.
We collected the following demographic information from each participant: their birth year, gender, highest (academic) degree achieved, race, language fluency, and language familiarity. We also collected information about participants’ health and wellness, including about their vision, alertness, stress, sleep, coffee and water consumption, location of their residence, activity typically required for their job, and physical activity habits.
Free recall (Fig. 1a)
Participants studied a sequence of four word lists, each comprising 16 words. After studying each list, participants received an immediate memory test, whereby they were asked to type (one word at a time) any words they remembered from the just-studied list, in any order.
Words were presented for 2 s each, in black text on a white background, followed by a 2 s blank (white) screen. After the final 2 s pause, participants were given 90 s to type in as many words as they could remember, in any order. The memory test was constructed such that the participant could only see the text of the current word they were typing; when they pressed any non-letter key, the current word was submitted and the text box they were typing in was cleared. This was intended to prevent participants from retroactively editing their previous responses.
The word lists participants studied were drawn from the categorized lists reported by32. Each participant was assigned four unique randomly chosen lists (in a randomized order), selected from a full set of 16 lists. Each chosen list was then randomly shuffled before presenting the words to the participants. Participants also performed a final delayed memory test where they were given 180 s to type out any words they remembered from any of the 4 lists they had studied.
Recalled words within an edit distance of 2 (i.e., a Levenshtein Distance less than or equal to 2) of any word in the wordpool were “autocorrected” to their nearest match. We also manually corrected clear typos or misspellings by hand (e.g., we corrected “hippoptumas” to “hippopotamus”, “zucinni” to “zucchini”, and so on). Finally, we lemmatized each submitted word to match the plurality of the matching wordpool word (e.g., “bongo” was corrected to “bongos”, and so on). After applying these corrections, any submitted words that matched words presented on the just-studied list were tagged as “correct” recalls, and any non-matching words were discarded as “errors.” Because participants were not allowed to edit the text they entered, we chose not to analyze these putative “errors,” since we could not distinguish typos from true misrememberings.
Naturalistic recall (Fig. 1b)
Participants watched a 2.5-min video clip entitled “The Temple of Knowledge.” The video comprises an animated story told to StoryCorps by Ronald Clark, who was interviewed by his daughter, Jamilah Clark. The narrator (Ronald) discusses growing up living in an apartment over the Washington Heights branch of the New York Public Library, where his father worked as a custodian during the 1940s.
After watching the video clip, participants were asked to type out anything they remembered about what happened in the video. They typed their responses into a text box, one sentence at a time. When the participant pressed the return key or typed any final punctuation mark (“.”, “!”, or “?”) the text currently entered into the box was “submitted” and added to their transcript, and the text box was cleared to prevent further editing of any already-submitted text. This was intended to prevent participants from retroactively editing their previous responses. Participants were given up to 10 min to enter their responses. After 4 min, participants were given the option of ending the response period early, e.g., if they felt they had finished entering all the information they remembered. Each participant’s transcript was constructed from their submitted responses by combining the sentences into a single document and removing extraneous whitespace characters. Following this 4–10-min free response period, participants were given a series of 10 multiple choice questions about the conceptual content of the story. All participants received the same questions, in the same order. Participants also performed a final delayed memory test, where they carried out the free response recall task a second time, near the end of the testing session. This resulted in a second transcript, for each participant.
Foreign language flashcards (Fig. 1c)
Participants studied a series of 10 English-Gaelic word pairs in a randomized order. We selected the Gaelic language both for its relatively small number of native speakers and for its dissimilarity to other commonly spoken languages amongst MTurk workers. We verified (via self report) that all of our participants were fluent in English and that they were neither fluent nor familiar with Gaelic.
Each word’s “flashcard” comprised a cartoon depicting the given word, the English word or phrase in lowercase text (e.g., “the boy”), and the Gaelic word or phrase in uppercase text (e.g., “BUACHAILL”). Each flashcard was displayed for 4 s, followed by a 3 s interval (during which the screen was cleared) prior to the next flashcard presentation.
After studying all 10 flashcards, participants were given a multiple choice memory test where they were shown a series of novel photographs, each depicting one of the 10 words they had learned. They were asked to select which (of 4 unique options) Gaelic word went with the given picture. The 3 incorrect options were selected at random (with replacement across trials), and the orders in which the choices appeared to the participant were also randomized. Each of the 10 words they had learned was tested exactly once.
Participants also performed a final delayed memory test, where they were given a second set of 10 questions (again, one per word they had studied). For this second set of questions participants were prompted with a new set of novel photographs, and new randomly chosen incorrect choices for each question. Each of the 10 original words they had learned were (again) tested exactly once during this final memory test.
Spatial learning (Fig. 1d)
Participants performed a series of study-test trials where they memorized the onscreen spatial locations of two or more shapes. During the study phrase of each trial, a set of shapes appeared on the screen for 10 s, followed by 2 s of blank (white) screen. During the test phase of each trial, the same shapes appeared onscreen again, but this time they were vertically aligned and sorted horizontally in a random order. Participants were instructed to drag (using the mouse) each shape to its studied position, and then to click a button to indicate that the placements were complete.
In different study-test trials, participants learned the locations of different numbers of shapes (always drawn from the same pool of 7 unique shapes, where each shape appeared at most one time per trial). They first performed three trials where they learned the locations of 2 shapes; next three trials where they learned the locations of 3 shapes; and so on until their last three trials, where (during each trial) they learned the locations of 7 shapes. All told, each participant performed 18 study-test trials of this spatial learning task (3 trials for each of 2, 3, 4, 5, 6, and 7 shapes).
Fitness tracking using Fitbit devices
To gain access to our study, participants provided us with access to all data associated with their Fitbit account from the year (365 calendar days) up to and including the day they accepted the HIT. We filtered out all identifiable information (e.g., participant names, GPS coordinates, etc.) prior to importing their data.
Collecting and processing Fitbit data
The fitness tracking data associated with participants’ Fitbit accounts varied in scope and duration according to which device the participant owned (Fig. S1), how often the participant wore (and/or synced) their tracking device, and how long they had owned their device. For example, while all participants’ devices supported basic activity metrics such as daily step counts, only a subset of the devices with heart rate monitoring capabilities provided information about workout intensity, resting heart rate, and other related measures. Across all devices, we collected the following information: heart rate data, sleep tracking data, logged bodyweight measurements, logged nutrition measurements, Fitbit account and device settings, and activity metrics.
If available, we extracted all heart rate data collected by participants’ Fitbit device(s) and associated with their Fitbit profile. Depending on the specific device model(s) and settings, this included second-by-second, minute-by-minute, daily summary, weekly summary, and/or monthly summary heart rate information. These summaries include information about participants’ average heart rates, and the amount of time they were estimated to have spent in different “heart rate zones” (rest, out-of-range, fat burn, cardio, or peak, as defined by their Fitbit profile), as well as an estimate of the number of estimated calories burned while in each heart rate zone.
If available, we extracted all sleep data collected by participants’ Fitbit device(s). Depending on the specific device model(s) and settings, this included nightly estimates of the duration and quality of sleep, as well as the amount of time spent in each sleep stage (awake, REM, light, or deep).
If available, we extracted any weight-related information affiliated with participants’ Fitbit accounts within 1 year prior to enrolling in our study. Depending on their specific device model(s) and settings, this included their weight, body mass index, and/or body fat percentage.
If available, we extracted any nutrition-related information affiliated with participants’ Fitbit accounts within 1 year prior to enrolling in our study. Depending on their specific account settings and usage behaviors, this included a log of the specific foods they had eaten (and logged) over the past year, and the amount of water consumed (and logged) each day.
Account and device settings
We extracted any settings associated with participants’ Fitbit accounts to determine (a) which device(s) and model(s) are associated with their Fitbit account, (b) time(s) when their device(s) were last synced, and (c) battery level(s).
If available, we extracted any activity-related information affiliated with participants’ Fitbit accounts within 1 year prior to enrolling in our study. Depending on their specific device model(s) and settings, this included: daily step counts; daily amount of time spent in each activity level (sedentary, lightly active, fairly active, or very active, as defined by their account settings and preferences); daily number of floors climbed; daily elevation change; and daily total distance traveled.
Comparing recent versus baseline measurements.
We were interested in separating out potential associations between absolute fitness metrics and relative metrics. To this end, in addition to assessing potential raw (absolute) fitness metrics, we also defined a simple measure of recent changes in those metrics, relative to a baseline:
$$beginaligned Delta _R, B m = fracB sum _i = 1^R m(i)R sum _i=R + 1^R+Bm(i), endaligned$$
where m(i) is the value of metric m from (i – 1) days prior to testing (e.g., m(1) represents the value of m on the day the participant accepted the HIT, and m(10) represents the value of m 9 days prior to accepting the HIT). We set (R = 7) and (B = 30). In other words, to estimate recent changes in any metric m, we divided the average value of m taken over the prior week by the average value of m taken over the 30 days before that.
Exploratory correlation analyses
We used a bootstrap procedure to identify reliable correlations between different memory-related, fitness-related, and demographic-related variables. For each of (N = 10,000) iterations, we selected (with replacement) a sample of 113 participants to include. This yielded, for each iteration, a sampled “data matrix” with one row per sampled participant and one column for each measured variable. When participants were sampled multiple times in a given iteration, as was often the case, this matrix contained duplicate rows. Next, we computed the Pearson’s correlation between each pair of columns. This yielded, for each pair of columns, a distribution of N bootstrapped correlation coefficients. If (97.5%) or fewer of the coefficients for a given pair of columns had the same sign, we excluded the pair from further analysis and considered the expected correlation between those columns to be undefined. If (> 97.5%) of the coefficients for a given pair of columns had the same sign (corresponding to a bootstrap-estimated two-tailed p threshold of 0.05), we computed the expected correlation coefficient as:
$$beginaligned mathbb E_i, jleft[ rright] = tanh left( frac1N sum _n=1^N tanh ^-1(mathrm corrleft( m(i)_n, m(j)_nright) )right) , endaligned$$
where (m(x)_n) represents column x of the bootstrapped data matrix for iteration n, (tanh) is the hyperbolic tangent, and (tanh ^-1) is the inverse hyperbolic tangent. We estimated the corresponding p-values for these correlations as one minus the proportion of bootstrapped correlations with the same sign, multiplied by two.
Reverse correlation analyses
We sought to characterize potential associations between the dynamics of participants’ fitness-related activities leading up to the time they participated in a memory task and their performance on the given task. For each fitness-related variable, we constructed a timeseries matrix whose rows corresponded to timepoints (sampled once per day) leading up to the day the participant accepted the HIT for our study, and whose columns corresponded to different participants. These matrices often contained missing entries, since different participants’ Fitbit devices tracked fitness-related activities differently. For example, participants whose Fitbit devices lacked heart rate sensors would have missing entries for any heart rate-related variables. Or, if a given participant neglected to wear their fitness tracker on a particular day, the column corresponding to that participant would have missing entries for that day. To create stable estimates, we smoothed the timeseries of each fitness measure using a sliding window of 1 week. In other words, for each fitness measure, we replaced the “observed value” for each day with the average values of that measure (when available) over the 7-day interval ending on the given day.
In addition to this set of matrices storing timeseries data for each fitness-related variable, we also constructed a memory performance matrix, M, whose rows corresponded to different memory-related variables, and whose columns corresponded to different participants. For example, one row of the memory performance matrix reflected the average proportion of words (across lists) that each participant remembered during the immediate free recall test, and so on.
Given a fitness timeseries matrix, F, we computed the weighted average and weighted standard error of the mean of each row of F, where the weights were given by a particular memory-related variable (row of M). For example, if F contained participants’ daily step counts, we could use any row of M to compute a weighted average across any participants who contributed step count data on each day. Choosing a row of M that corresponded to participants’ performance on the naturalistic recall task would mean that participants who performed better on the naturalistic recall task would contribute more to the weighted average timeseries of daily step counts. Specifically, for each row, t, of F, we computed the weighted average (across the S participants) as: