Ch14

Strings and regular expressions.

This chapter is incredibly useful. I recommend that you take notes, if that helps you learn. At the very least read the chapter and the stringr vignette. The tools of regular expressions are available in most languages, including Python and Java.

Below is a copy-and-paste of all of the good problems in Wickham’s R4DS, Chapter 14.

Section 2: String Basics

  1. In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?
  2. (in class) In your own words, describe the difference between the sep and collapse arguments to str_c().
  3. Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?
  4. (skip) What does str_wrap() do? When might you want to use it?
  5. What does str_trim() do? Can you imagine how it might be useful when working with data input by people?
  6. (Advanced/bonus.) Write a function that turns (e.g.) a vector c(“a”, “b”, “c”) into the string “a, b, and c”. Think carefully about what it should do if given a vector of length 0, 1, or 2. For example, c(“a”,“b”) should turn into “a and b”.

Section 3: Matching

3.1

  1. Explain why each of these strings don’t match a \: "\", "\\", "\\\".
  2. How would you match the sequence "'\?
  3. What patterns will the regular expression \..\..\.. match? How would you represent it as a string?

3.2

  1. How would you match the literal string "$^$"?
  2. Given the corpus of common words in stringr::words, create regular expressions that find all words that:
    1. Start with “y”.
    2. End with “x”
    3. Are exactly three letters long. (Don’t cheat by using str_length()!)
    4. Have seven letters or more.

Note: Since the list of words is long, you might want to use the str_subset to show only the matching words.

3.3

  1. Create regular expressions to find all words that:
    1. Start with a vowel.
    2. That only contain consonants. (Hint: thinking about matching “not”-vowels.)
    3. End with ed, but not with eed.
    4. End with ing or ise.
  2. Empirically verify the rule “i before e except after c”.
  3. Is “q” always followed by a “u”?
  4. Write a regular expression that matches a word if it’s probably written in British English, not American English.
  5. (Discussed in class.) Create a regular expression that will match telephone numbers as commonly written in your country.

3.4

  1. Describe the equivalents of ?, +, * in {m,n} form.
  2. Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.) 1. ^.*$ 2. "\\{.+\\}" 3. \d{4}-\d{2}-\d{2} 4. "\\\\{4}"
  3. Create regular expressions to find all words that:
    1. Start with three consonants.
    2. Have three or more vowels in a row.
    3. Have two or more vowel-consonant pairs in a row.
  4. (Bonus) Solve a beginner regexp crosswords. You will want to read how to play.

3.5

  1. Describe, in words, what these expressions will match:
    1. (.)\1\1
    2. "(.)(.)\\2\\1"
    3. (..)\1
    4. "(.).\\1.\\1"
    5. "(.)(.)(.).*\\3\\2\\1"
  2. Construct regular expressions to match words that:
    1. Start and end with the same character.
    2. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
    3. Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

Section 4: Tools

4.1

  1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.

    1. Find all words that start or end with x.
    2. Find all words that start with a vowel and end with a consonant.
    3. Are there any words that contain at least one of each different vowel?
  2. (a) What word has the highest number of vowels? (b) What word has the highest proportion of vowels?

4.2

  1. You might notice that the regular expression for colors (red|orange|yellow|...) matched “flickered”, which is not a colour. Modify the regex to fix the problem.
  2. From the Harvard sentences data (available once you run library(tidyverse)), extract:
    1. The first word from each sentence.
    2. All words ending in ing.
    3. All plurals.

4.3

  1. Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word. (Be sensible, put some limit on the numbers you recognize.)
  2. Find all contractions. Separate out the pieces before and after the apostrophe.

4.4

  1. Replace all forward slashes in a string with backslashes.
  2. Implement a simple version of str_to_lower() using replace_all().
  3. (a) Switch the first and last letters in words. (b) (Harder?) Which of those strings are still words?

4.5

  1. Split up a string like "apples, pears, and bananas" into individual components.
  2. Why is it better to split up by boundary("word") than " "?
  3. What does splitting with an empty string ("") do? Experiment, and then read the documentation.

4.6 Advanced / Optional

  1. How would you find all strings containing \ with regex() vs. with fixed()?
  2. What are the five most common words in sentences?
Last modified August 18, 2023: 2022-2023 End State (7352e87)