Arc Forumnew | comments | leaders | submitlogin
List of characters vs string
5 points by d0m 5174 days ago | 12 comments
Simply put, why should we have strings instead of a simple list of character?

From my experience with Arc, strings are a bit weird as sometime they work with the core functions while sometime they don't. Sometime we need to use apply str on them, sometime we don't.

What I propose instead is for the printer to show a list of characters as a "string". A little bit as we show '(1 2 3) instead of (1 . (2 . (3... Just to be clear, (eval "test") would really return a list of character.. but the printer would print it as "test".

The advantage would be to be able to use all list functions on "string" without the burden of trying to guess when it'd work and when it wouldn't.

Also, it would make Arc even more minimalist.

The reader could also automatically convert "test" to a list of character.. so that:

(map whatever "test") would work as expected.

I really enjoy reading your answers that point out where I'm wrong, don't hesitate to post :-)



3 points by bogomipz 5174 days ago | link

The arguments against strings as lists are that they're going to be incredibly slow and take up many times more memory. Characters will in practice be cons cells, which means they take up something like 8 bytes each, are allocated individually, and generate lots of work for the garbage collector. As others have mentioned, accessing the nth element of a linked list is very slow, and even iterating the list is slower than for arrays.

I would much rather prefer string to be a subtype of list, with the functions that actually access the elements dispatching on the type. This would have all the benefits without the performance loss.

The missing bits to enable this in a general way are transparent type/annotate, generic functions, and hierarcical types. By transparent I mean that annotate should not return a cons cell with the original value inside. The runtime should have support for marking each value with a type.

-----

1 point by aw 5174 days ago | link

That's very interesting. I don't think I've heard of the idea of having the printer display a list of characters as a string before.

-----

3 points by fallintothis 5174 days ago | link

I thought it would've been kind of implicit in the perennial "hey, let's make strings lists" idea. Which isn't new, by the way.

Haskell

  Prelude> :type "abc"
  "abc" :: [Char]
Erlang

  1> A="abc".
  "abc"
  2> is_list(A).
  true
Prolog

  ?- is_list("abc").
  true.
What else do all these strings-are-lists languages have in common? They're notoriously slow for string processing. Linked lists use up more memory and simple accesses are O(n) rather than O(1). It's to the point that the Haskell guys wrote the ByteString library to override the standard Prelude functions and added a GHC extension to parse "abc" as a ByteString instead of [Char].

Arc itself has deviated on this point (from http://paulgraham.com/arcll1.html). Comments in arc.arc include:

  ; compromises in this implementation: 
  ...
  ; separate string type
  ;  (= (cdr (cdr str)) "foo") couldn't work because no way to get str tail
  ;  not sure this is a mistake; strings may be subtly different from 
  ;  lists of chars

  ; idea: get rid of strings and just use symbols
  ; could a string be (#\a #\b . "") ?
PicoLisp uses the strings-are-symbols approach (http://software-lab.de/doc/faq.html#strings). And that illustrates that it's all a matter of focus. PicoLisp is dynamic, perhaps to a fault. By design, it has no arrays, strings, lambda keyword, or even a compiler. But if you have any investment in efficiency, proper strings seem smarter, if only as a library for when you need the performance (as in Haskell). Never mind arguable conceptual differences.

Arc awkwardly straddles this divide. It has (a) hash tables instead of alists, but (b) lists instead of arrays, (c) a "compiler" that's mostly to keep code from being prohibitively slow, and (d) proper strings, but everyone says they want lists.

It seems to me that the people who want strings == lists are mostly looking for a Grand Unified Theory of Sequences. The idea is that if you don't want the standard lib to be a mess of

  (case (type x)
    cons
      ...
    string
      (coerce 'string (... (coerce x 'cons)))
    god-forbid-we-add-another-sequence-type
      (coerce ...))
you'll just make everything lists, and to hell with the distinction. Everything's the same, so it's simpler!

You could tackle the problem from the other end. Other languages invest in abstractions to write code that sanely deals with different data types. What sort of abstractions do we get out of this? Often some form of object orientation, be it message-passing or generic functions or what-have-you. It starts looking pretty enticing: man, we could make everything an object, and to hell with the distinction. Everything's the same, so it's simpler!

Hmm...

-----

4 points by d0m 5174 days ago | link

Do you think it's a bad thing to have a specialized library for optimized string processing? In the big majority of cases, I personally don't care at all about the performance of my string processing. Would the needs arrive, I'd be perfectly willing to use another library specialized for that (Which I would accept that all the high level function won't work for optimization reasons).

Also, as you say, a couple of (case) would be needed in the standard library. But again, I hardly see why it is a bad thing? (Maybe multi-methods would be a better approach but that's another debate). My point is that I don't really care what the standard lib code looks like.. the important part is the end language result.. no?

Finally, I get your point about the "Grand Unified Theory of Sequences". However, this list of chars suggestion is more about making the code behaviour simpler.. not that much about trying to be "minimalist to be minimalist".

For instance, what (map bleh "test") should do? What should it return? If I know that "test" is simply a list of character, I know for sure what would happen (bleh would be applied on a character and map would returns a list of whatever the function returns). Now, what does it currently do? Maybe the same thing, maybe not. (See that post: http://arclanguage.org/item?id=12311) If my function returns a string instead of a character, it crashes.. since when does map expect something special? I.e. (map ... (range 1 10)) doesn't expect me to return a list of integer.

-----

2 points by fallintothis 5174 days ago | link

Do you think it's a bad thing to have a specialized library for optimized string processing?

Not in principle. Especially if it was totally transparent. This isn't the case in Haskell, because you wind up importing ByteString functions qualified with a prefix like B, so your code is littered with B.map this and B.length that.

But then, if it's totally transparent (i.e., same behavior & syntax between the inefficient one and the efficient one), what was the point of the separation to begin with? Just use the efficient one.

Also, as you say, a couple of (case) would be needed in the standard library.

Actually, no. I was saying that treating strings as lists would make code shorter. E.g., map has to check for strings manually right now.

  (def map (f . seqs)
    (if (some [isa _ 'string] seqs)
         (withs (n   (apply min (map len seqs))
                 new (newstring n))
           ((afn (i)
              (if (is i n)
                  new
                  (do (sref new (apply f (map [_ i] seqs)) i)
                      (self (+ i 1)))))
            0))
        (no (cdr seqs))
         (map1 f (car seqs))
        ((afn (seqs)
          (if (some no seqs)
              nil
              (cons (apply f (map1 car seqs))
                    (self (map1 cdr seqs)))))
         seqs)))
But if strings were built out of cons, we'd be able to prune out that first branch.

  (def map (f . seqs)
    (if (no (cdr seqs))
        (map1 f (car seqs))
        ((afn (seqs)
           (if (some no seqs)
               nil
               (cons (apply f (map1 car seqs))
                     (self (map1 cdr seqs)))))
         seqs)))
(On that note, would there be a difference between "" and nil?)

Finally, I get your point about the "Grand Unified Theory of Sequences". However, this list of chars suggestion is more about making the code behaviour simpler.. not that much about trying to be "minimalist to be minimalist".

My point was that fusing lists and strings is a popular idea since Arc doesn't really have good facilities for polymorphism. You currently need to do the (case ...) junk (or macroexpand into it or whatever), since there are sequences with their subtle interface differences. Strings-are-lists would get rid of this (until we needed another sequence type...). Object-oriented languages "do polymorphism" better (e.g., with multimethods), but breed a fervor for all-things-OO similar to the just-make-everything-a-list sentiment, even though both are extremes.

I don't really care what the standard lib code looks like.. the important part is the end language result.. no?

Right now Arc sits in the middle, without a unified sequence interface -- be it making everything a list or having some other abstraction -- so we all suffer for it, both in the standard library and in user code. Here's a fun one:

  arc> (count #\a '(#\a #\b #\c))
  1
  arc> (counts '(#\a #\b #\c))
  #hash((#\b . 1) (#\c . 1) (#\a . 1))
  arc> (count #\a "abc")
  1
  arc> (counts "abc")
  Error: "Can't take car of \"abc\""
The difference being count uses each, which is polymorphic (i.e., is written essentially with a big (case ...)), but counts hard-codes recursion on cdrs.

See that post: http://arclanguage.org/item?id=12311

Not to get too hung up on this example -- I can appreciate the simplicity of the strings-are-lists model where map is concerned -- but...

You could argue that map's current behavior is sane. There's a subtle difference between expected behaviors, depending on how you want to interpret map. Currently, map says the output sequence should be of the same type as the input. As such, you might expect an error from

  (map [+ "abc" _] "def")
if it was to be understood roughly as

  (map-as 'string [+ "abc" _] "def")
since strings aren't really lists -- they don't have arbitrary elements. (Using the map-as from my post wouldn't actually throw an error here. coerce special-cases strings and flattens them. Yet another subtlety in expectations: should map's "same type" behavior abide by coerce's rules?)

This tangentially relates back to the contagion point in http://paulgraham.com/arcll1.html -- how to shift between strings and lists depending on what sorts of elements are added.

-----

2 points by d0m 5173 days ago | link

(Note: I don't by any means want to be aggressive. I'm still a beginner in the lisp world.. so everything I say here as a big: "Me being a real newbie thinks: " in front :) )

"But then, if it's totally transparent (i.e., same behavior & syntax between the inefficient one and the efficient one), what was the point of the separation to begin with? Just use the efficient one."

Maybe I wasn't clear.. but I never said it needed to be transparent. If I need my strings to be highly optimized, I'd much prefer use a specialized string library and use, say:

  (let optimized-string (create-optimized-string (get-100000-chars))
   (optimized-string-map blah optimized-string))

or maybe a:

  (with/optimized-string
     (let bleh (get-100000-chars)
       (map blah bleh)))
(which would take care of rebinding core high level functions)

-----

Also, I'm not sure I agree with you on the "(until we needed another sequence type...)." Why another sequence type? Any sequences could be coerce to simple list which work with the core functions. Use map-hash, or map-file, or map-whatever if you need specialized version for other sequences. (Or as I said in the last example)

-----

Finally, again (sorry), I don't really agree on: "map says the output sequence should be of the same type as the input."

Is it true?

  (map [string "test" _] '(1 2 3)) -> ("test1" "test2" "test3")
  (map [string "test" _] "123")) -> err
It's only for string that the output needs to be the input. (Which is weird at best in my opinion).

-----

1 point by fallintothis 5173 days ago | link

I don't by any means want to be aggressive.

Neither do I. Just trying to be terse (not that I am anyways; I have a problem with conciseness). :)

I never said it needed to be transparent

No, but I was. In my limited experience, using ByteStrings in Haskell is still too much work, as opposed to having the standard Prelude functions work "out of the box". Instead of being able to say

  map (\x -> 'x') "abc"
you wind up needing to

  import qualified Data.Bytestring as B

  B.map (\x -> 'x') (B.pack "abc")
When you need to start changing every string-related function, the reader, the writer, blah blah blah, it gets to be a hassle. Perhaps it's less of a pain in dynamically-typed languages like Arc. I don't know.

Why another sequence type? Any sequences could be coerce to simple list which work with the core functions.

Not everything is a linked list. And forcefully coercing every data structure into a linked list wrecks the time & space complexities that give many data structures their purpose. You'd force ranges (virtual sequences, a la Python) to eat up memory, though you certainly want to (say) map over them. You force arrays to have O(n) random access times. You couldn't have immutable sequences because linked lists are mutable. Etc.

Use map-hash, or map-file, or map-whatever if you need specialized version for other sequences. (Or as I said in the last example)

Frankly, ad-hoc polymorphism is ugly. Take Scheme, for instance, whose (R5RS) standard comparison operators include:

  char=? char<? char>? char<=? char>=? string=? string<? string>? string<=?
  string>=?  = < > <= >=
where overloaded comparisons would do.

I don't really agree on: "map says the output sequence should be of the same type as the input."

  arc> (type '(1 2 3))
  cons
  arc> (type (map [string "test" _] '(1 2 3)))
  cons
  arc> (type "123")
  string
  arc> (type (map inc "123"))
  string

-----

1 point by d0m 5173 days ago | link

"When you need to start changing every string-related function, the reader, the writer, blah blah blah, it gets to be a hassle. Perhaps it's less of a pain in dynamically-typed languages like Arc. I don't know."

I also don't know :) Maybe advanced arc users could share their opinion on that?

"Not everything is a linked list. And forcefully coercing every data structure into a linked list wrecks the time & space complexities that give many data structures their purpose."

Yes, this is true.

In fact, it's a decision that needs to be taken.. should the core high level function work against different data structure? (As it does with clojure?)

As for now, map does work with string (but has a weird behavior in my opinion) and doesn't work with hash. Is this what we want?

Also, what do you think about multimethods? Do you think it's something useful to be added in Arc?

---------

About the map input versus output, I get what you mean. However, (map [string "test" _] "123") should work :-/ Maybe the problem lies in the concatenation operator while constructing the new string. i.e.

  (map [coerce _ 'int] "123") could give "116101115116"
 
(I know it's not a good example... however, look at this one) :

  (map [string "(" _ ")"] "test") -> Shouldn't it return "(t)(e)(s)(t)" ?!

-----

2 points by fallintothis 5172 days ago | link

should the core high level function work against different data structure?

The answer seems to be a resounding yes. Polymorphism was one of Arc's main principles when it was 3 weeks old (http://paulgraham.com/arcll1.html), and one that's been more-or-less preserved to this day -- just clunkily.

doesn't work with hash

I don't care for maptable in Arc. Seems like something that should be done by map. I address the point about hash-tables-as-sequences more in http://arclanguage.org/item?id=12341.

Also, what do you think about multimethods? Do you think it's something useful to be added in Arc?

From what I've seen, generic functions (single dispatch or multimethods) seem a "Lisp-y" way of solving the type-dispatch problem, and you don't really need to go full-blown OO about it. But I don't know much about other options.

However, (map [string "test" _] "123") should work :-/

Oh, I agree. I was just saying, the case could be made. :P

I think my map-as (from your original thread: http://arclanguage.org/item?id=12341) reflects map's intended behavior best, since coerce is really our standard for how types should interact. Not that it should be implemented that way, necessarily. But since coerce will flatten strings, it seems fair to say that

  (map [string "test" _] "123")
and

  (coerce (map1 [string "test" _] (coerce "123" 'cons)) 'string)
should return the same thing. At present, map errors, but

  arc> (coerce (map1 [string "test" _] (coerce "123" 'cons)) 'string)
  "test1test2test3"
The map-as approach would still preserve the input type == output type behavior, but do it by coerce's rules, which happen to play nicely with strings here.

-----

1 point by akkartik 5171 days ago | link

I've been thinking about the semantics of maptable. Right now it seems misnamed; it iterates over the table before returning it unmodified. But what should the right semantics be? Should it return a table with modified values for the same set of keys? Or should it return a list? Or should each iteration return a (k v) pair so you can get a new table with entirely new keys (I think map-as does this)? All of these could be useful; I think we need a more elaborate language than just map/fold to describe them.

-----

2 points by rocketnia 5172 days ago | link

(map [string "(" _ ")"] "test")

The function [string "(" _ ")"] returns strings, so if anything, the result of that expression should be a sequence of strings, not a string itself.

Nevertheless, maybe (mappend [string "(" _ ")"] "test") should do what you're thinking.

-----

1 point by akkartik 5174 days ago | link

I think this one's a winner :) The word 'minimalist' gives me the warm fuzzies.

-----