Strings
Strings are an important concept for any programmer to master. Rust’s string handling system is a bit different from other languages, due to its systems focus. Any time you have a data structure of variable size, things can get tricky, and strings are a re-sizable data structure. That being said, Rust’s strings also work differently than in some other systems languages, such as C.
Let’s dig into the details. A ‘string’ is a sequence of Unicode scalar values encoded as a stream of UTF-8 bytes. All strings are guaranteed to be a valid encoding of UTF-8 sequences. Additionally, unlike some systems languages, strings are not NUL-terminated and can contain NUL bytes.
Rust has two main types of strings: &str
and String
. Let’s talk about
&str
first. These are called ‘string slices’. A string slice has a fixed
size, and cannot be mutated. It is a reference to a sequence of UTF-8 bytes.
# #![allow(unused_variables)] #fn main() { let greeting = "Hello there."; // greeting: &'static str #}
"Hello there."
is a string literal and its type is &'static str
. A string
literal is a string slice that is statically allocated, meaning that it’s saved
inside our compiled program, and exists for the entire duration it runs. The
greeting
binding is a reference to this statically allocated string. Any
function expecting a string slice will also accept a string literal.
String literals can span multiple lines. There are two forms. The first will include the newline and the leading spaces:
# #![allow(unused_variables)] #fn main() { let s = "foo bar"; assert_eq!("foo\n bar", s); #}
The second, with a \
, trims the spaces and the newline:
# #![allow(unused_variables)] #fn main() { let s = "foo\ bar"; assert_eq!("foobar", s); #}
Note that you normally cannot access a str
directly, but only through a &str
reference. This is because str
is an unsized type which requires additional
runtime information to be usable. For more information see the chapter on
unsized types.
Rust has more than only &str
s though. A String
is a heap-allocated string.
This string is growable, and is also guaranteed to be UTF-8. String
s are
commonly created by converting from a string slice using the to_string
method.
# #![allow(unused_variables)] #fn main() { let mut s = "Hello".to_string(); // mut s: String println!("{}", s); s.push_str(", world."); println!("{}", s); #}
String
s will coerce into &str
with an &
:
fn takes_slice(slice: &str) { println!("Got: {}", slice); } fn main() { let s = "Hello".to_string(); takes_slice(&s); }
This coercion does not happen for functions that accept one of &str
’s traits
instead of &str
. For example, TcpStream::connect
has a parameter
of type ToSocketAddrs
. A &str
is okay but a String
must be explicitly
converted using &*
.
# #![allow(unused_variables)] #fn main() { use std::net::TcpStream; TcpStream::connect("192.168.0.1:3000"); // Parameter is of type &str. let addr_string = "192.168.0.1:3000".to_string(); TcpStream::connect(&*addr_string); // Convert `addr_string` to &str. #}
Viewing a String
as a &str
is cheap, but converting the &str
to a
String
involves allocating memory. No reason to do that unless you have to!
Indexing
Because strings are valid UTF-8, they do not support indexing:
let s = "hello";
println!("The first letter of s is {}", s[0]); // ERROR!!!
Usually, access to a vector with []
is very fast. But, because each character
in a UTF-8 encoded string can be multiple bytes, you have to walk over the
string to find the nᵗʰ letter of a string. This is a significantly more
expensive operation, and we don’t want to be misleading. Furthermore, ‘letter’
isn’t something defined in Unicode, exactly. We can choose to look at a string as
individual bytes, or as codepoints:
# #![allow(unused_variables)] #fn main() { let hachiko = "忠犬ハチ公"; for b in hachiko.as_bytes() { print!("{}, ", b); } println!(""); for c in hachiko.chars() { print!("{}, ", c); } println!(""); #}
This prints:
229, 191, 160, 231, 138, 172, 227, 131, 143, 227, 131, 129, 229, 133, 172,
忠, 犬, ハ, チ, 公,
As you can see, there are more bytes than char
s.
You can get something similar to an index like this:
# #![allow(unused_variables)] #fn main() { # let hachiko = "忠犬ハチ公"; let dog = hachiko.chars().nth(1); // Kinda like `hachiko[1]`. #}
This emphasizes that we have to walk from the beginning of the list of chars
.
Slicing
You can get a slice of a string with the slicing syntax:
# #![allow(unused_variables)] #fn main() { let dog = "hachiko"; let hachi = &dog[0..5]; #}
But note that these are byte offsets, not character offsets. So this will fail at runtime:
# #![allow(unused_variables)] #fn main() { let dog = "忠犬ハチ公"; let hachi = &dog[0..2]; #}
with this error:
thread 'main' panicked at 'byte index 2 is not a char boundary; it is inside '忠'
(bytes 0..3) of `忠犬ハチ公`'
Concatenation
If you have a String
, you can concatenate a &str
to the end of it:
# #![allow(unused_variables)] #fn main() { let hello = "Hello ".to_string(); let world = "world!"; let hello_world = hello + world; #}
But if you have two String
s, you need an &
:
# #![allow(unused_variables)] #fn main() { let hello = "Hello ".to_string(); let world = "world!".to_string(); let hello_world = hello + &world; #}
This is because &String
can automatically coerce to a &str
. This is a
feature called ‘Deref
coercions’.